4621 lines
197 KiB
Text
4621 lines
197 KiB
Text
|
|
||
|
@-1. TX 0 @General
|
||
|
|
||
|
@-2. T 0 @Screen control
|
||
|
|
||
|
@-2. X 0 @Screen
|
||
|
|
||
|
@-3. T 0 @Statistical analysis of content
|
||
|
|
||
|
@-3. X 0 @Statistics
|
||
|
|
||
|
@-4. T 0 @Structures and repeats
|
||
|
|
||
|
@-4. X 0 @Structures
|
||
|
|
||
|
@-5. TX 0 @Translation and codons
|
||
|
|
||
|
@-6. TX 0 @Gene search by content
|
||
|
|
||
|
@-7. TX 0 @General signals
|
||
|
|
||
|
@-8. TX 0 @Specific signals
|
||
|
|
||
|
@0. TX -1 @NIP
|
||
|
|
||
|
|
||
|
This is a program for analysing individual nucleotide
|
||
|
sequences. It can read sequences stored in many of the most commonly
|
||
|
used formats, and performs all of the usual simple analyses. However
|
||
|
the main purpose of the program is to provide methods for finding
|
||
|
the function of each section of a sequence. In general no single
|
||
|
method can give an unequivecal interpretation of a sequence so we
|
||
|
need to use many techniques together and to combine their results.
|
||
|
For this reason the program present many of its results
|
||
|
graphically.
|
||
|
|
||
|
General information is contained in the user interface. Online
|
||
|
documentation for any function follows a consistent pattern: summary,
|
||
|
list of inputs, list of outputs, details, example.
|
||
|
@1. TX 0 @ Help
|
||
|
|
||
|
This option gives online help. The user should select option
|
||
|
numbers and the current documentation will be given. Note that
|
||
|
option 0 gives an introduction to the program, and that ? will get
|
||
|
help from anywhere in the program. The following functions are
|
||
|
included:
|
||
|
@2. TX 0 @ Quit
|
||
|
|
||
|
This function stops the program.
|
||
|
@3. TX 1 @ Read a new sequence
|
||
|
|
||
|
This option allows users to read in new sequences, browse
|
||
|
through annotations, or search sequence libraries for keywords.
|
||
|
Sequences can be read from "personal" sequence files or from
|
||
|
sequence libraries. These are referred to as the sequence "source".
|
||
|
Personal files can be stored in several formats: Staden, PIR, EMBL,
|
||
|
GENBANK and GCG. At LMB we use "Staden" format for sequencing and
|
||
|
all the libraries are stored in their original formats. Note,
|
||
|
however, that libraries such as EMBL or GenBank that are divided
|
||
|
into several files (eg GenBank has 13 separate files) are indexed as
|
||
|
a whole. This means that users do not need to know which file
|
||
|
contains an entry, only which library. When the user selects to
|
||
|
read in a sequence the program first asks for the sequence "source".
|
||
|
|
||
|
If the user selects "personal" the program will ask for the
|
||
|
format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
|
||
|
the file. For PIR format the user will also be required to know the
|
||
|
entry name of the sequence as the file can contain several. For the
|
||
|
other formats only a single entry is expected. The file will be
|
||
|
read, its length and composition will be displayed and the option
|
||
|
left.
|
||
|
|
||
|
If the user selects "library" as the sequence source the
|
||
|
program will display a list of available libraries. The programs are
|
||
|
capable of handling all current libraries but which ones are
|
||
|
available will vary from site to site. At LMB we have several
|
||
|
libraries and also weekly updates of data gathered between releases.
|
||
|
The program will ask users to select a library and then give a list
|
||
|
of options:
|
||
|
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entrynames from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
|
||
|
If get a sequence or get annotations is selected users will be asked
|
||
|
to type the entry name. The option will be left when a sequence is
|
||
|
selected or ! is typed. The composition and length will be
|
||
|
displayed.
|
||
|
|
||
|
The text index contains all words from feature tables,
|
||
|
reference titles, definition lines, keywords lists and comments, so
|
||
|
the text index search is most useful. It is also the fastest. Up to
|
||
|
5 words can be searched for at once. The words should be typed
|
||
|
separated by spaces, for example
|
||
|
? Keywords=P53 mouse murine tumo
|
||
|
|
||
|
will search for all entries that contain words starting with p53,
|
||
|
mouse, murine and tumo. Only the unique entries that contain ALL
|
||
|
words will be listed. Before listing the matching entries the
|
||
|
program will show the number of 'hits' for each word and ring the
|
||
|
bell. Escape is possible at this point, or after each screenfull of
|
||
|
entries. In addition to the entry names the text search displays
|
||
|
the primary accession number, the sequence length and up to 80
|
||
|
characters of description. (The search of 'titles' is now redundant
|
||
|
because the full text index contains all the title words and the
|
||
|
search is much faster. It will probably be removed from the
|
||
|
program.) All searches are independent of case. Where possible the
|
||
|
program will offer default entry names.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
Select sequence source
|
||
|
X 1 Personal file
|
||
|
2 Sequence library
|
||
|
? Selection (1-2) (1) =
|
||
|
Select sequence file format
|
||
|
X 1 Staden
|
||
|
2 EMBL
|
||
|
3 GenBank
|
||
|
4 PIR
|
||
|
5 GCG
|
||
|
? Selection (1-5) (1) =
|
||
|
? Sequence file name=M13MP7.SEQ
|
||
|
Contig title removed
|
||
|
Sequence length= 7238
|
||
|
Sequence composition
|
||
|
T C A G -
|
||
|
2405. 1539. 1765. 1527. 2.
|
||
|
33.2% 21.3% 24.4% 21.1% 0.0%
|
||
|
.
|
||
|
.
|
||
|
.
|
||
|
|
||
|
|
||
|
Select sequence source
|
||
|
X 1 Personal file
|
||
|
2 Sequence library
|
||
|
? Selection (1-2) (1) =2
|
||
|
Select a library
|
||
|
X 1 EMBL 29 nucleotide library Dec 91
|
||
|
2 SWISSPROT 20 protein library Nov 91
|
||
|
3 PIR 31 protein library Dec 91
|
||
|
4 NRL3D 58 From Brookhaven protein library Dec 91
|
||
|
5 GenBank
|
||
|
? Selection (1-5) (1) =
|
||
|
Library is in EMBL format with indexes
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =5
|
||
|
Search for keywords
|
||
|
? Keywords=P53 mouse
|
||
|
P53 hits 68
|
||
|
MOUSE hits 8180
|
||
|
|
||
|
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
|
||
|
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
|
||
|
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
|
||
|
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
|
||
|
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
|
||
|
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
|
||
|
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
|
||
|
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
|
||
|
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
|
||
|
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
|
||
|
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
|
||
|
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
|
||
|
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
|
||
|
22 different entries found
|
||
|
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =4
|
||
|
Search for keywords
|
||
|
? Keywords=alpha
|
||
|
Searching for alpha
|
||
|
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
|
||
|
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
|
||
|
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
|
||
|
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
|
||
|
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
|
||
|
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
|
||
|
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
|
||
|
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
|
||
|
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
|
||
|
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
|
||
|
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
|
||
|
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
|
||
|
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
||
|
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
||
|
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
||
|
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
|
||
|
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
|
||
|
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
|
||
|
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
|
||
|
!
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =3
|
||
|
? Accession number=v00636
|
||
|
Entry name LAMBDA
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =2
|
||
|
Default Entry name=LAMBDA
|
||
|
? Entry name=
|
||
|
ID LAMBDA standard; DNA; PHG; 48502 BP.
|
||
|
XX
|
||
|
AC V00636; J02459; M17233; X00906;
|
||
|
XX
|
||
|
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
|
||
|
DT 09-JUN-1982 (Rel. 1, Created)
|
||
|
XX
|
||
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
||
|
XX
|
||
|
KW circular; coat protein; DNA binding protein; genome;
|
||
|
KW origin of replication.
|
||
|
XX
|
||
|
OS Bacteriophage lambda
|
||
|
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
|
||
|
XX
|
||
|
RN [1]
|
||
|
RP 1-48502
|
||
|
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
|
||
|
RT "Nucleotide sequence of bacteriophage lambda DNA";
|
||
|
RL J. Mol. Biol. 162:729-773(1982).
|
||
|
XX
|
||
|
!
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =
|
||
|
Default Entry name=LAMBDA
|
||
|
? Entry name=
|
||
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
||
|
Sequence length 48502
|
||
|
Sequence composition
|
||
|
T C A G -
|
||
|
11988. 11360. 12336. 12818. 0.
|
||
|
24.7% 23.4% 25.4% 26.4% 0.0%
|
||
|
|
||
|
@4. TX 1 @ Define active region
|
||
|
|
||
|
For its analytic functions the program always works on a
|
||
|
region of the sequence called the "active region". This function
|
||
|
allows the start and end points of the active region to be reset.
|
||
|
|
||
|
Define the required start and end points.
|
||
|
|
||
|
When a new sequence is read into the program the active region
|
||
|
is automatically set to start at the beginning of the sequence and
|
||
|
extend to the maximum the program can handle. On most machines this
|
||
|
will be to the end of the sequence. The positions are shown on the
|
||
|
screen. Note that for convenience, in the listing and translation
|
||
|
functions, the user is given access to regions outside the active
|
||
|
region.
|
||
|
@5. TX 1 @ List a sequence
|
||
|
|
||
|
The sequence can be listed single or double stranded with line
|
||
|
lengths from 10 to 120 in multiples of 10.
|
||
|
|
||
|
Define the region to list, the line length required and choose
|
||
|
between a single or double stranded display. The output looks like:
|
||
|
|
||
|
GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT
|
||
|
CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA
|
||
|
10 20 30 40 50
|
||
|
|
||
|
AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA
|
||
|
TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT
|
||
|
60 70 80 90 100
|
||
|
|
||
|
GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA
|
||
|
CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT
|
||
|
110 120 130 140 150
|
||
|
|
||
|
AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT
|
||
|
TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA
|
||
|
160 170 180 190 200
|
||
|
|
||
|
@6. TX 1 @ List a text file.
|
||
|
|
||
|
Allows the user to have a text file displayed on the screen.
|
||
|
It will appear one page at a time.
|
||
|
|
||
|
Supply the name of the file to be displayed.
|
||
|
@7. TX 1 @ Direct output to disk
|
||
|
|
||
|
Used to direct output that would normally appear on the screen
|
||
|
to a file.
|
||
|
|
||
|
Select redirection of either text or graphics, and supply the
|
||
|
name of the file that the output should be written to.
|
||
|
|
||
|
The results from the next options selected will not appear on
|
||
|
the screen but will be written to the file. When option 7 is
|
||
|
selected again the file will be closed and output will again appear
|
||
|
on the screen.
|
||
|
@8. TX 1 @ Write active region to disk
|
||
|
|
||
|
Used to write the current active section of sequence to a disk
|
||
|
file in "Staden format".
|
||
|
|
||
|
Supply a file name and an optional title.
|
||
|
|
||
|
The program has the capability of reading sequences stored in
|
||
|
several formats and so, in conjunction with this option, can be used
|
||
|
to reformat them.
|
||
|
@9. TX 1 @ Edit the sequence
|
||
|
|
||
|
Used to edit sequences or any other files by giving access to
|
||
|
the computers system editor. For editing sequences the input file
|
||
|
should have already been created using one of the listing functions
|
||
|
such as "list sequence", "list translation" or "list restriction
|
||
|
sites above the sequence".
|
||
|
|
||
|
Supply the name of the file to edit. Wait while the system
|
||
|
editor is made ready (can take awhile on a vax). Use the editor.
|
||
|
Exit from the editor. If a sequence has been edited, and you want to
|
||
|
process it, affirm that the sequence should be "made active". The
|
||
|
edited sequence will replace the original sequence.
|
||
|
|
||
|
This editing method is designed to give users access to an
|
||
|
editor with which they are familiar - i.e. the one on their machine,
|
||
|
and yet to allow them to edit a sequence which contains all the
|
||
|
landmarks they need in order to know where they are. Users can
|
||
|
create files containing simple listings (single stranded) with
|
||
|
numbering, using "list the sequence", and then edit them with their
|
||
|
system editor, using the numbering to know where they are within the
|
||
|
sequence. When the edits are complete they exit from the editor and
|
||
|
the program "analyses" the edited file to extract only the sequence
|
||
|
characters. Similarly a file containing a three phase tranlslation
|
||
|
can be edited, or a file containing a sequence plus its three phase
|
||
|
translation, plus its restriction sites marked above the sequence.
|
||
|
In order to be able to "analyse" such complicated listings and
|
||
|
correctly extract the sequence the following simple rule is used:
|
||
|
all lines in the file that contain a character that is not A,C,T,G
|
||
|
or U are deleted. It is obviously important to be aware of this rule
|
||
|
and its implications.
|
||
|
@10. TX 2 @ Clear graphics
|
||
|
|
||
|
Clears graphics from the screen.
|
||
|
@11. TX 2 @ Clear text
|
||
|
|
||
|
Clears text from the screen.
|
||
|
@12. TX 2 @ Draw a ruler
|
||
|
|
||
|
This option allows the user to draw a ruler or scale along the
|
||
|
x axis of the screen to help identify the coordinates of points of
|
||
|
interest. The user can define the position of the first base to be
|
||
|
marked (for example if the active region is 1501 to 8000, the user
|
||
|
might wish to mark every 1000th base starting at either 1501 or 2000
|
||
|
- it depends if the user wishes to treat the active region as an
|
||
|
independent unit with its own numbering starting at its left edge,
|
||
|
or as part of the whole sequence). The user can also define the
|
||
|
separation of the ticks on the scale and their height. If required
|
||
|
the labelling routine can be used to add numbers to the ticks.
|
||
|
@13. TX 2 @ Use crosshair
|
||
|
|
||
|
This function puts a steerable cross on the screen that can be
|
||
|
used to find the coordinates of points in the sequence. The user can
|
||
|
move the cross around using the directional keys; when he hits the
|
||
|
space bar the program will print out the coordinates of the cross in
|
||
|
sequence units and the option will be exited.
|
||
|
|
||
|
If instead, you hit a , the position will be displayed but the
|
||
|
cross will remain on the screen.
|
||
|
|
||
|
If a letter s is hit the program will display the sequence
|
||
|
around the crosshair position, and leave the cross on the screen.
|
||
|
@14. TX 2 @ Reposition plots
|
||
|
|
||
|
The positions of each of the plots is defined relative to a
|
||
|
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
|
||
|
Plots for each option are drawn in a window defined by x0,y0 and
|
||
|
xlength,ylength. Where x0,y0 is the position of the bottom left hand
|
||
|
corner of the window, and xlength is the width of the window and
|
||
|
ylength the height of the window.
|
||
|
--------------------------------------------------------- 10,000
|
||
|
1 1
|
||
|
1 -------------------------------------- ^ 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 ylength 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 -------------------------------------- v 1
|
||
|
1 x0,y0^ 1
|
||
|
1 <---------------xlength--------------> 1
|
||
|
--------------------------------------------------------- 1
|
||
|
1 10,000
|
||
|
|
||
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
||
|
The default window positions are read from a file "NIPMARG" when the
|
||
|
program is started. Users can have their own file if required. As
|
||
|
all the plots start at the same position in x and have the same
|
||
|
width, x0 and xlength are the same for all options. Generally users
|
||
|
will only want to change the start level of the window y0 and its
|
||
|
height ylength. This option allows users to change window positions
|
||
|
whilst running the program. The routine prompts first for the
|
||
|
number of the option that the users wishes to reposition; then for
|
||
|
the y start and height; then for the x start and length. Note that
|
||
|
changes to the x values affect all options. If the user types only
|
||
|
carriage return for any value it will remain unchanged. The cross-
|
||
|
hair can be used to choose suitable heights.
|
||
|
@15. TX 2 @ Label a diagram
|
||
|
|
||
|
This routine allows users to label any diagrams they have
|
||
|
produced. They are asked to type in a label. When the user types
|
||
|
carriage return to finish typing the label the cross-hair appears on
|
||
|
the screen. The user can position it anywhere on the screen. If the
|
||
|
user types R (for right justify) the label will be written on the
|
||
|
diagram with its right end at the cross-hair position. If the user
|
||
|
types L (for left justify) the label will be written on the diagram
|
||
|
with its left end at the cross hair position. The cross-hair will
|
||
|
then immediately reappear. The user may put the same label on
|
||
|
another part of the diagram as before or if he hits the space bar he
|
||
|
will be asked if he wishes to type in another label.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=15
|
||
|
Type label then drive cross hair to left or right end
|
||
|
of label position then hit "L" to write label left
|
||
|
justified or "R" to write label right justified or
|
||
|
the space bar to quit
|
||
|
|
||
|
|
||
|
? Label=delta gene
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? Label=
|
||
|
|
||
|
@16. TX 2 @Display a map
|
||
|
|
||
|
This draws a map of any sequence features selected by the
|
||
|
user. These features may be protein coding regions (CDS), tRNA
|
||
|
genes (TRNA), promoter positions (PRM), etc. Users may define their
|
||
|
own feature table key names. For example I find it convenient to
|
||
|
split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
|
||
|
those sequences that code in the reading frames 1, 2 or 3. Then I
|
||
|
can plot them at different heights on the screen ( suitable heights
|
||
|
can be determined by using the cross-hair).
|
||
|
|
||
|
The coordinates must be stored in a file in the format of an
|
||
|
EMBL or GenBank feature table. Note that this means that the file
|
||
|
must include either EMBL or GenBank headers, and a suitable "tail".
|
||
|
The simplest header is the word FEATURES starting in column 1 of the
|
||
|
first line of the file. The simplest tail is 2 empty lines at the
|
||
|
end of the file. These lines are not included when nip writes out
|
||
|
results in feature table format.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=16
|
||
|
Display a map using an EMBL feature table file
|
||
|
? map file name=hsegl1.ft
|
||
|
? feature code(e.g. CDS) =CDS
|
||
|
X 1 + strand
|
||
|
2 - strand
|
||
|
3 both strands
|
||
|
? 0,1,2,3 =
|
||
|
? level (0-9480) (256) =4000
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? feature code(e.g. CDS) =
|
||
|
|
||
|
@17. TX 1 @ Search for restriction enzymes
|
||
|
|
||
|
This routine is used to search for short sequences, like
|
||
|
restriction enzyme recognition sequences, and can either list the
|
||
|
results or present them graphically. Listings can take several forms
|
||
|
and can include the sequence and its translation. Examples are given
|
||
|
below. The program will also display the names of enzymes that cut
|
||
|
the sequence infrequently. Users can select from sets of enzymes
|
||
|
stored in files or can enter them from the keyboard.
|
||
|
|
||
|
The short sequences (strings) and their names need to be
|
||
|
arranged in a particular way. See below. Select to search, list an
|
||
|
enzyme file or clear the screen. Choose either a file of enzymes or
|
||
|
to enter their recognition sequences at the keyboard. Choose to
|
||
|
search for all the enzymes in the list or to select from the list.
|
||
|
Select a mode of output. Define the sequence as circular or linear.
|
||
|
Select to search for "definite" or "possible" matches. The search
|
||
|
starts, and after the results have been displayed, further searches
|
||
|
can be performed.
|
||
|
|
||
|
When the enzymes and their recognition sequences are stored in
|
||
|
a file they must be defined in the following way. We call the
|
||
|
recognition sequences "strings". The format is as follows: each
|
||
|
string or set of strings must be preceded by a name, each string
|
||
|
must be preceded and terminated with a slash (/), and each set of
|
||
|
strings by 2 slashes. For example AATII/GACGT'C// defines the name
|
||
|
AATII, its recognition sequence GACGTC and its cut site with the '
|
||
|
symbol; ACCI/GT'MKAC// defines the name ACCI and its recognition
|
||
|
sequence includes IUB symbols for incompletely defined symbols in
|
||
|
nucleic acid sequences; BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
|
||
|
defines the name BBVI and this time two recognition sequences and
|
||
|
cut sites are specified in order to correctly show the cutting
|
||
|
position relative to the recognition sequence. If no cut site is
|
||
|
included the first base of the recognition sequence is displayed as
|
||
|
being on the 3' side of the recognition sequence.
|
||
|
|
||
|
These collections of strings and their names can be read from
|
||
|
disk or entered from the keyboard. When names and strings are
|
||
|
entered from the keyboard the program will ask for the name and then
|
||
|
the string(s). If more than one string is typed per name they must
|
||
|
be separated by slash (/) characters. See the "Typical dialogue"
|
||
|
below. Three files containing restriction enzyme recognition
|
||
|
sequences are currently available. The "all enzymes" file contains
|
||
|
the Rich Roberts REBASE restriction enzyme database, which is
|
||
|
updated monthly.
|
||
|
|
||
|
The user can select strings by name from these collections. If
|
||
|
so the program will prompt for the names, one at a time. The user
|
||
|
can continue to select names until a blank name is entered (by the
|
||
|
user typing only return).
|
||
|
|
||
|
Listed output can be displayed in several ways: it can be
|
||
|
ordered enzyme by enzyme, or on cut positions, or with enzyme names
|
||
|
written above a listing of the sequence. This last listing can also
|
||
|
include a three phase translation of the sequence. In addition the
|
||
|
program will display only infrequent cutters (the user defines the
|
||
|
minimum number of cuts), or can plot the positions of matches.
|
||
|
|
||
|
Listings sorted "enzyme by enzyme" have the following form:
|
||
|
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 AATII GACGT'C 112 111 111
|
||
|
912 912
|
||
|
Matches found= 2
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 ACCI GT'CGAC 112 111 111
|
||
|
2 ACCI GT'AGAC 420 308 308
|
||
|
604 604
|
||
|
Matches found= 2
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 AHAII GA'CGTC 109 108 90
|
||
|
2 AHAII GG'CGTC 199 90 108
|
||
|
825 825
|
||
|
Matches found= 2
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 AVAII G'GACC 84 83 51
|
||
|
2 AVAII G'GTCC 973 889 83
|
||
|
51 889
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BALI TGG'CCA 258 257 257
|
||
|
766 766
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BAMHI G'GATCC 92 91 91
|
||
|
|
||
|
...... etc
|
||
|
|
||
|
Listings sorted on cut position have the following form:
|
||
|
|
||
|
Searching
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 ECORI G'AATTC 2 1
|
||
|
2 BANI G'GTGCC 26 24
|
||
|
3 BSP1286 GTGCC'C 31 5
|
||
|
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
|
||
|
5 NSPBII CAG'CTG 51 13
|
||
|
6 PVUII CAG'CTG 51 0
|
||
|
7 BBVI GCAGCTGCTGGTG' 60 9
|
||
|
8 HINCII GTC'AAC 80 20
|
||
|
9 AVAII G'GACC 84 4
|
||
|
10 BINI 'CCAGGGATCC 87 3
|
||
|
11 BSTNI CC'AGG 89 2
|
||
|
12 BAMHI G'GATCC 92 3
|
||
|
13 XHOII G'GATCC 92 0
|
||
|
14 NSPBII CCG'CTG 98 6
|
||
|
15 BINI GGATCCGCT' 100 2
|
||
|
16 AHAII GA'CGTC 109 9
|
||
|
17 SALI G'TCGAC 111 2
|
||
|
18 AATII GACGT'C 112 1
|
||
|
19 ACCI GT'CGAC 112 0
|
||
|
20 HINCII GTC'GAC 113 1
|
||
|
21 BBVI GCAGCGACTGATT' 166 53
|
||
|
22 BINI 'ACTCAGATCC 178 12
|
||
|
23 XHOII A'GATCC 183 5
|
||
|
24 HGAI 'GGCGGCGGAGGCGTC 188 5
|
||
|
|
||
|
.....etc
|
||
|
|
||
|
Lists of infrequent cutters have the following form:
|
||
|
|
||
|
0 AFLII
|
||
|
0 AFLIII
|
||
|
0 APAI
|
||
|
0 APALI
|
||
|
0 ASUII
|
||
|
0 AVAI
|
||
|
0 AVRII
|
||
|
0 BCLI
|
||
|
0 BGLI
|
||
|
0 BGLII
|
||
|
0 BSMI
|
||
|
0 BSPMII
|
||
|
0 BSTEII
|
||
|
...... etc
|
||
|
|
||
|
Listings showing names above the sequence, and a translation have the
|
||
|
following form:
|
||
|
|
||
|
|
||
|
ECORI BANI BSP1286
|
||
|
. . . BBVI NSPBII
|
||
|
. . . . PVUII BBVI
|
||
|
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
|
||
|
GTGC
|
||
|
10 20 30 40 50 60
|
||
|
E F G L G L V * G A Q R L L R R S C W C
|
||
|
N S V W A W C E V P R D Y C A A A A G A
|
||
|
I R F G L G V R C P E I T A P Q L L V L
|
||
|
|
||
|
HINCII
|
||
|
. AVAII
|
||
|
. . BINI
|
||
|
. . . BSTNI
|
||
|
. . . . BAMHI
|
||
|
. . . . XHOII NSPBII
|
||
|
. . . . . . BINI AHAII
|
||
|
. . . . . . . . SALI
|
||
|
. . . . . . . . .AATII
|
||
|
. . . . . . . . .ACCI
|
||
|
. . . . . . . . ..HINCII
|
||
|
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
|
||
|
ACGA
|
||
|
70 80 90 100 110 120
|
||
|
W R C G G R Q R T Q G S A G R G R R R R
|
||
|
G G A E V V N G P R D P L D E D V D D E
|
||
|
A V R R S S T D P G I R W T R T S T T R
|
||
|
|
||
|
BBVI BINI
|
||
|
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
|
||
|
CACT
|
||
|
130 140 150 160 170 180
|
||
|
G G G G * R I A G G W Q R L I * V L T T
|
||
|
E E V D S A L L V A G S D * F E F * P L
|
||
|
R R W I A H C W W L A A T D L S S D H S
|
||
|
|
||
|
XHOII
|
||
|
. HGAI AHAII PFIMI
|
||
|
. . . . BBVI
|
||
|
CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA
|
||
|
GATT
|
||
|
190 200 210 220 230 240
|
||
|
Q I R R R R R R G S R N S Q W L A C * I
|
||
|
R S G G G G V E A P E T P S G W P A R F
|
||
|
D P A A E A S R L P K L P V A G L L D S
|
||
|
|
||
|
.........etc
|
||
|
|
||
|
|
||
|
The terms "possible" and "definite" matches are important only
|
||
|
for back translations of protein into DNA, and which include IUB
|
||
|
redundancy codes. Those matches that the program terms "definite
|
||
|
matches" and are ones in which the specification of the recognition
|
||
|
sequence corresponds exactly to that of the back translation, and
|
||
|
consequently are definitely in the DNA sequence. The program will
|
||
|
also find what it terms 'possible matches' which are ones that
|
||
|
depend on the particular codons chosen for each amino acid. These
|
||
|
are sites at which recognition sequences could be engineered to
|
||
|
produce a cut in the DNA without changing the amino acid, but which
|
||
|
are not necessarily found in the original sequence.
|
||
|
|
||
|
The routine will handle both linear and circular sequences,
|
||
|
and so finds cutsites spanning the "ends" of circular sequences.
|
||
|
The program will only find cutsites spanning the ends of sequences
|
||
|
if the sequence is declared as circular. This includes sites for
|
||
|
recognition sequences containing leading or trailing N symbols, in
|
||
|
which the actual recognition sequence does not span the join. For
|
||
|
example if the recognition sequence was 'NNNNACGT and the first 4
|
||
|
characters in the sequence were ACGT, then the match would only be
|
||
|
found if the sequence was declared as circular. If the sequence is
|
||
|
linear then the first fragment starts at base number 1, and the last
|
||
|
ends at the last base. If the sequence is circular then the length
|
||
|
of the first fragment is the clockwise distance from the last cut to
|
||
|
the first.
|
||
|
|
||
|
Graphical output marks the position of each string by a short
|
||
|
vertical line and gives the name of the enzyme at the left end of
|
||
|
the line. If the top of the screen is reached the program gives the
|
||
|
user the oportunity to take a hard copy and then will clear the
|
||
|
screen and restart plotting results at the original start position.
|
||
|
|
||
|
Below is an edited piece of dialogue from use of the search
|
||
|
option:
|
||
|
? Menu or option number=17
|
||
|
|
||
|
Search for restriction enzyme sites
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 = 2
|
||
|
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =
|
||
|
|
||
|
AATII/GACGT'C//
|
||
|
ACCI/GT'MKAC//
|
||
|
AFLII/C'TTAAG//
|
||
|
AFLIII/A'CRYGT//
|
||
|
AHAII/GR'CGYC//
|
||
|
APAI/GGGCC'C//
|
||
|
APALI/G'TGCAC//
|
||
|
ASUII/TT'CGAA//
|
||
|
AVAI/C'YCGRG//
|
||
|
AVAII/G'GWCC//
|
||
|
AVRII/C'CTAGG//
|
||
|
BALI/TGG'CCA//
|
||
|
BAMHI/G'GATCC//
|
||
|
BANI/G'GYRCC//
|
||
|
BANII/GRGCY'C//
|
||
|
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
|
||
|
BCLI/T'GATCA//
|
||
|
BGLI/GCCNNNN'NGGC//
|
||
|
BGLII/A'GATCT//
|
||
|
BINI/GGATCNNNN'/'NNNNNGATCC//
|
||
|
BSMI/GAATGCN'/NG'CATTC//
|
||
|
BSP1286/GDGCH'C//
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =
|
||
|
? (y/n) (y) Search for all names
|
||
|
X 1 Order results enzyme by enzyme
|
||
|
2 Order results by position
|
||
|
3 Show only infrequent cutters
|
||
|
4 Show names above the sequence
|
||
|
? 0,1,2,3,4 =
|
||
|
? (y/n) (y) List matches
|
||
|
? (y/n) (y) The sequence is linear
|
||
|
? (y/n) (y) Search for definite matches
|
||
|
|
||
|
Searching
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 AATII GACGT'C 112 111 111
|
||
|
912 912
|
||
|
Matches found= 2
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 ACCI GT'CGAC 112 111 111
|
||
|
2 ACCI GT'AGAC 420 308 308
|
||
|
604 604
|
||
|
Matches found= 2
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 AHAII GA'CGTC 109 108 90
|
||
|
2 AHAII GG'CGTC 199 90 108
|
||
|
825 825
|
||
|
Matches found= 2
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 AVAII G'GACC 84 83 51
|
||
|
2 AVAII G'GTCC 973 889 83
|
||
|
51 889
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BALI TGG'CCA 258 257 257
|
||
|
766 766
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BAMHI G'GATCC 92 91 91
|
||
|
932 932
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BANI G'GTGCC 26 25 25
|
||
|
998 998
|
||
|
Matches found= 1
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BANII GAGCC'C 490 489 489
|
||
|
534 534
|
||
|
Matches found= 11
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 BBVI 'TACTGCGCCGCAGCTGC 38 37 3
|
||
|
2 BBVI GCAGCTGCTGGTG' 60 22 22
|
||
|
3 BBVI GCAGCGACTGATT' 166 106 28
|
||
|
4 BBVI 'CCTGCTAGATTCGCTGC 230 64 37
|
||
|
5 BBVI GCAGCGGTACGTA' 452 222 50
|
||
|
6 BBVI 'CTCGCCAACGTTGCTGC 502 50 55
|
||
|
7 BBVI GCAGCCTTCAACT' 606 104 64
|
||
|
8 BBVI 'GAGGTATTCCTGGCTGC 634 28 97
|
||
|
9 BBVI 'CTGGCCGCCGCCGCTGC 869 235 104
|
||
|
10 BBVI 'GCCGCCGCCGCTGCTGC 872 3 106
|
||
|
11 BBVI GCAGCGATGAGGA' 927 55 222
|
||
|
|
||
|
....etc
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =
|
||
|
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =
|
||
|
|
||
|
? (y/n) (y) Search for all names
|
||
|
|
||
|
X 1 Order results enzyme by enzyme
|
||
|
2 Order results by position
|
||
|
3 Show only infrequent cutters
|
||
|
4 Show names above the sequence
|
||
|
? 0,1,2,3,4 = 2
|
||
|
|
||
|
? (y/n) (y) List matches
|
||
|
? (y/n) (y) The sequence is linear
|
||
|
? (y/n) (y) Search for definite matches
|
||
|
|
||
|
Searching
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 ECORI G'AATTC 2 1
|
||
|
2 BANI G'GTGCC 26 24
|
||
|
3 BSP1286 GTGCC'C 31 5
|
||
|
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
|
||
|
5 NSPBII CAG'CTG 51 13
|
||
|
6 PVUII CAG'CTG 51 0
|
||
|
7 BBVI GCAGCTGCTGGTG' 60 9
|
||
|
8 HINCII GTC'AAC 80 20
|
||
|
9 AVAII G'GACC 84 4
|
||
|
10 BINI 'CCAGGGATCC 87 3
|
||
|
11 BSTNI CC'AGG 89 2
|
||
|
12 BAMHI G'GATCC 92 3
|
||
|
13 XHOII G'GATCC 92 0
|
||
|
14 NSPBII CCG'CTG 98 6
|
||
|
15 BINI GGATCCGCT' 100 2
|
||
|
16 AHAII GA'CGTC 109 9
|
||
|
17 SALI G'TCGAC 111 2
|
||
|
18 AATII GACGT'C 112 1
|
||
|
19 ACCI GT'CGAC 112 0
|
||
|
20 HINCII GTC'GAC 113 1
|
||
|
|
||
|
.....etc
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =
|
||
|
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =
|
||
|
|
||
|
? (y/n) (y) Search for all names
|
||
|
|
||
|
1 Order results enzyme by enzyme
|
||
|
X 2 Order results by position
|
||
|
3 Show only infrequent cutters
|
||
|
4 Show names above the sequence
|
||
|
? 0,1,2,3,4 =3
|
||
|
? Maximum number of cuts (0-100) (0) =
|
||
|
? (y/n) (y) The sequence is linear
|
||
|
? (y/n) (y) Search for definite matches
|
||
|
|
||
|
Searching
|
||
|
0 AFLII
|
||
|
0 AFLIII
|
||
|
0 APAI
|
||
|
0 APALI
|
||
|
0 ASUII
|
||
|
0 AVAI
|
||
|
0 AVRII
|
||
|
0 BCLI
|
||
|
0 BGLI
|
||
|
0 BGLII
|
||
|
0 BSMI
|
||
|
0 BSPMII
|
||
|
0 BSTEII
|
||
|
0 CLAI
|
||
|
0 DRAI
|
||
|
0 DRAII
|
||
|
0 ECOB
|
||
|
0 ECOK
|
||
|
0 ECORV
|
||
|
0 ESPI
|
||
|
|
||
|
......etc
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =
|
||
|
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =
|
||
|
|
||
|
? (y/n) (y) Search for all names
|
||
|
|
||
|
1 Order results enzyme by enzyme
|
||
|
2 Order results by position
|
||
|
X 3 Show only infrequent cutters
|
||
|
4 Show names above the sequence
|
||
|
? 0,1,2,3,4 =4
|
||
|
? (y/n) (y) Hide translation n
|
||
|
? (y/n) (y) Use 1 letter codes
|
||
|
? Line length (30-90) (60) =
|
||
|
? (y/n) (y) The sequence is linear
|
||
|
? (y/n) (y) Search for definite matches
|
||
|
|
||
|
Searching
|
||
|
ECORI BANI BSP1286
|
||
|
. . . BBVI NSPBII
|
||
|
. . . . PVUII BBVI
|
||
|
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
|
||
|
GTGC
|
||
|
10 20 30 40 50 60
|
||
|
E F G L G L V * G A Q R L L R R S C W C
|
||
|
N S V W A W C E V P R D Y C A A A A G A
|
||
|
I R F G L G V R C P E I T A P Q L L V L
|
||
|
|
||
|
HINCII
|
||
|
. AVAII
|
||
|
. . BINI
|
||
|
. . . BSTNI
|
||
|
. . . . BAMHI
|
||
|
. . . . XHOII NSPBII
|
||
|
. . . . . . BINI AHAII
|
||
|
. . . . . . . . SALI
|
||
|
. . . . . . . . .AATII
|
||
|
. . . . . . . . .ACCI
|
||
|
. . . . . . . . ..HINCII
|
||
|
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
|
||
|
ACGA
|
||
|
70 80 90 100 110 120
|
||
|
W R C G G R Q R T Q G S A G R G R R R R
|
||
|
G G A E V V N G P R D P L D E D V D D E
|
||
|
A V R R S S T D P G I R W T R T S T T R
|
||
|
|
||
|
BBVI BINI
|
||
|
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
|
||
|
CACT
|
||
|
130 140 150 160 170 180
|
||
|
G G G G * R I A G G W Q R L I * V L T T
|
||
|
E E V D S A L L V A G S D * F E F * P L
|
||
|
R R W I A H C W W L A A T D L S S D H S
|
||
|
|
||
|
.......etc
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =
|
||
|
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =5
|
||
|
Define search strings by typing a string name
|
||
|
followed by the string(s)
|
||
|
? Name=FRED
|
||
|
? String(s)=AAAAAA/TTTTTT
|
||
|
? Name=MARY
|
||
|
? String(s)=CCCC/GGGG/GCGCT
|
||
|
? Name=
|
||
|
? (y/n) (y) Search for all names
|
||
|
X 1 Order results enzyme by enzyme
|
||
|
2 Order results by position
|
||
|
3 Show only infrequent cutters
|
||
|
4 Show names above the sequence
|
||
|
? 0,1,2,3,4 =
|
||
|
? (y/n) (y) List matches
|
||
|
? (y/n) (y) The sequence is linear
|
||
|
? (y/n) (y) Search for definite matches
|
||
|
Searching
|
||
|
Matches found= 9
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 FRED 'TTTTTT 1557 1556 1
|
||
|
2 FRED 'TTTTTT 1558 1 1
|
||
|
3 FRED 'TTTTTT 1559 1 1
|
||
|
4 FRED 'TTTTTT 1560 1 22
|
||
|
5 FRED 'AAAAAA 1582 22 529
|
||
|
6 FRED 'AAAAAA 3160 1578 1019
|
||
|
7 FRED 'AAAAAA 4204 1044 1044
|
||
|
8 FRED 'AAAAAA 5691 1487 1487
|
||
|
9 FRED 'AAAAAA 6710 1019 1556
|
||
|
529 1578
|
||
|
Matches found= 36
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 MARY 'CCCC 47 46 1
|
||
|
2 MARY 'GGGG 486 439 1
|
||
|
3 MARY 'GGGG 487 1 1
|
||
|
4 MARY 'CCCC 557 70 1
|
||
|
5 MARY 'CCCC 558 1 1
|
||
|
6 MARY 'GCGCT 1177 619 1
|
||
|
|
||
|
... etc
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =
|
||
|
1 All enzymes
|
||
|
X 2 Six cutters
|
||
|
3 Four cutters
|
||
|
4 Personal file
|
||
|
5 Keyboard
|
||
|
? 0,1,2,3,4,5 =5
|
||
|
Define search strings by typing a string name
|
||
|
followed by the string(s)
|
||
|
? Name=JANE
|
||
|
? String(s)=A'TTTT/CC'GGG
|
||
|
? Name=
|
||
|
? (y/n) (y) Search for all names
|
||
|
X 1 Order results enzyme by enzyme
|
||
|
2 Order results by position
|
||
|
3 Show only infrequent cutters
|
||
|
4 Show names above the sequence
|
||
|
? 0,1,2,3,4 =
|
||
|
? (y/n) (y) List matches
|
||
|
? (y/n) (y) The sequence is linear
|
||
|
? (y/n) (y) Search for definite matches
|
||
|
Searching
|
||
|
Matches found= 30
|
||
|
Name Sequence Position Fragment lengths
|
||
|
1 JANE A'TTTT 437 436 6
|
||
|
2 JANE A'TTTT 546 109 33
|
||
|
3 JANE A'TTTT 597 51 43
|
||
|
4 JANE A'TTTT 777 180 51
|
||
|
5 JANE A'TTTT 1274 497 60
|
||
|
6 JANE A'TTTT 1571 297 62
|
||
|
7 JANE CC'GGG 1926 355 75
|
||
|
8 JANE A'TTTT 2403 477 81
|
||
|
9 JANE A'TTTT 2586 183 82
|
||
|
10 JANE A'TTTT 2731 145 101
|
||
|
11 JANE A'TTTT 2812 81 103
|
||
|
|
||
|
... etc
|
||
|
|
||
|
|
||
|
X 1 Search
|
||
|
2 List enzyme file
|
||
|
3 Clear text
|
||
|
4 Clear graphics
|
||
|
? 0,1,2,3,4 =!
|
||
|
@18. TX 1 7 @ Compare a short sequence
|
||
|
|
||
|
This routine slides a short sequence along the current
|
||
|
sequence and finds all positions at which a given percentage of the
|
||
|
bases match. Output is in both graphical and listed forms.
|
||
|
|
||
|
If users call for dialogue when the routine is selected they
|
||
|
will be given the choice of keyboard or file input. Define the
|
||
|
string, select the "sense" to use and the percentage match. Matches
|
||
|
will be plotted out and then the user can select to have them
|
||
|
listed. Then the routine cycles around.
|
||
|
|
||
|
The routine slides the search string along the sequence and
|
||
|
marks the positions at which a minimum percentage score is reached.
|
||
|
The graphical output draws a vertical line at the match position;
|
||
|
the height of the line represents the percentage score, so that if
|
||
|
the line reaches the top of the box the score is 100%. The NC-IUB
|
||
|
symbols may be used in the search sequence to encode uncertain
|
||
|
characters. Any other symbols will not match.
|
||
|
|
||
|
|
||
|
NC-IUB SYMBOLS
|
||
|
|
||
|
A,C,G,T
|
||
|
R (A,G) 'puRine'
|
||
|
Y (T,C) 'pYrimidine'
|
||
|
W (A,T) 'Weak'
|
||
|
S (C,G) 'Strong'
|
||
|
M (A,C) 'aMino'
|
||
|
K (G,T) 'Keto'
|
||
|
H (A,T,C) 'not G'
|
||
|
B (G,C,T) 'not A'
|
||
|
V (G,A,C) 'not T'
|
||
|
D (G,A,T) 'not C'
|
||
|
N (G,A,C,T) 'aNy'
|
||
|
|
||
|
Typical dialogue is shown below.
|
||
|
|
||
|
|
||
|
? Menu or option number=18
|
||
|
Find percentage matches
|
||
|
? (y/n) (y) Keep picture
|
||
|
? String=AAATTTCCC
|
||
|
STRING=AAATTTCCC
|
||
|
? (y/n) (y) This sense
|
||
|
? Percent match (1.00-100.00) (70.00) =
|
||
|
|
||
|
Missing graphics display here
|
||
|
|
||
|
Total scoring positions above 70.000 percent = 7
|
||
|
Scores 7 6 6 6 6 6 6
|
||
|
Positions 365 212 213 292 311 358 627
|
||
|
? Display (0-7) (0) =3
|
||
|
|
||
|
365
|
||
|
ACATTTCGC
|
||
|
* ***** *
|
||
|
AAATTTCCC
|
||
|
1
|
||
|
|
||
|
212
|
||
|
GAAACTCCC
|
||
|
** ****
|
||
|
AAATTTCCC
|
||
|
1
|
||
|
|
||
|
213
|
||
|
AAACTCCCA
|
||
|
*** * **
|
||
|
AAATTTCCC
|
||
|
1
|
||
|
? (y/n) (y) Keep picture
|
||
|
Default String=AAATTTCCC
|
||
|
? String=
|
||
|
STRING=AAATTTCCC
|
||
|
? (y/n) (y) This sense n
|
||
|
STRING=GGGAAATTT
|
||
|
? Percent match (1.00-100.00) (70.00) =
|
||
|
|
||
|
Missing graphics display here
|
||
|
|
||
|
Total scoring positions above 70.000 percent = 7
|
||
|
Scores 6 6 6 6 6 6 6
|
||
|
Positions 269 270 271 288 354 624 853
|
||
|
? Display (0-7) (0) =3
|
||
|
|
||
|
269
|
||
|
GAGGGATTT
|
||
|
* * ****
|
||
|
GGGAAATTT
|
||
|
1
|
||
|
|
||
|
270
|
||
|
AGGGATTTT
|
||
|
** * ***
|
||
|
GGGAAATTT
|
||
|
1
|
||
|
|
||
|
271
|
||
|
GGGATTTTC
|
||
|
**** **
|
||
|
GGGAAATTT
|
||
|
1
|
||
|
? (y/n) (y) Keep picture !
|
||
|
|
||
|
@19. TX 7 @ Compare a short sequence using a score matrix
|
||
|
|
||
|
This routine slides a short sequence along the current
|
||
|
sequence and finds all positions at which a given level of
|
||
|
similarity (a cutoff score) is reached. The score is defined by use
|
||
|
of a score matrix. Output is in both graphical and listed forms.
|
||
|
|
||
|
If users call for dialogue when the routine is selected they
|
||
|
will be given the choice of keyboard or file input. Define the
|
||
|
string, select the "sense" to use and the cutoff score. Matches will
|
||
|
be plotted out and then the user can select to have them listed.
|
||
|
Then the routine cycles around.
|
||
|
|
||
|
The routine slides the search string along the sequence and
|
||
|
marks the positions at which a the cutoff score is achieved. The
|
||
|
graphical output draws a vertical line at the match position; the
|
||
|
height of the line represents the score, so that if the line
|
||
|
reaches the top of the box the score is the maximum possible. The
|
||
|
NC-IUB symbols may be used in the search sequence to encode
|
||
|
uncertain characters.
|
||
|
|
||
|
The score matrix reflects the level of redundancy in the probe
|
||
|
sequence and hence will put more emphasis on those characters that
|
||
|
are better defined. The score matrix is:
|
||
|
DNA SCORE MATRIX USING IUB SYMBOLS
|
||
|
|
||
|
T C A G - R Y W S M K H B V D N ?
|
||
|
|
||
|
T 36 0 0 0 9 0 18 18 0 0 18 12 12 0 12 9 0
|
||
|
C 0 36 0 0 9 0 18 0 18 18 0 12 12 12 0 9 0
|
||
|
A 0 0 36 0 9 18 0 18 0 18 0 12 0 12 12 9 0
|
||
|
G 0 0 0 36 9 18 0 0 18 0 18 0 12 12 12 9 0
|
||
|
- 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
|
||
|
R 0 0 18 18 18 36 0 9 9 9 9 6 6 12 12 18 0
|
||
|
Y 18 18 0 0 18 0 36 9 9 9 9 12 12 6 6 18 0
|
||
|
W 18 0 18 0 18 9 9 36 0 9 9 12 6 6 12 18 0
|
||
|
S 0 18 0 18 18 9 9 0 36 9 9 6 12 12 6 18 0
|
||
|
M 0 18 18 0 18 9 9 9 9 36 0 12 6 12 6 18 0
|
||
|
K 18 0 0 18 18 9 9 9 9 0 36 6 12 6 12 18 0
|
||
|
H 12 12 12 0 27 6 12 12 6 12 6 36 8 8 8 27 0
|
||
|
B 12 12 0 12 27 6 12 6 12 6 12 8 36 8 8 27 0
|
||
|
V 0 12 12 12 27 12 6 6 12 12 6 8 8 36 8 27 0
|
||
|
D 12 0 12 12 27 12 6 12 6 6 12 8 8 8 36 27 0
|
||
|
N 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
|
||
|
? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
|
||
|
|
||
|
? is any unrecognised character.
|
||
|
|
||
|
Typical dialogue is shown below.
|
||
|
|
||
|
? Menu or option number=19
|
||
|
Find matches using a score matrix
|
||
|
? (y/n) (y) Keep picture
|
||
|
? String=AAATTTCCC
|
||
|
STRING=AAATTTCCC
|
||
|
? (y/n) (y) This sense
|
||
|
Minimum score= 0 Maximum score= 324
|
||
|
? Score (0-324) (280) =250
|
||
|
|
||
|
Missing graphics display here
|
||
|
|
||
|
For score 250 the number of matches= 1
|
||
|
Scores 252
|
||
|
Positions 365
|
||
|
? Display (0-1) (0) =1
|
||
|
|
||
|
365
|
||
|
ACATTTCGC
|
||
|
* ***** *
|
||
|
AAATTTCCC
|
||
|
1
|
||
|
? (y/n) (y) Keep picture
|
||
|
Default String=AAATTTCCC
|
||
|
? String=
|
||
|
STRING=AAATTTCCC
|
||
|
? (y/n) (y) This sense n
|
||
|
STRING=GGGAAATTT
|
||
|
Minimum score= 0 Maximum score= 324
|
||
|
? Score (0-324) (222) = 200
|
||
|
|
||
|
Missing graphics display here
|
||
|
|
||
|
For score 200 the number of matches= 7
|
||
|
Scores 216 216 216 216 216 216 216
|
||
|
Positions 269 270 271 288 354 624 853
|
||
|
? Display (0-7) (0) =3
|
||
|
|
||
|
269
|
||
|
GAGGGATTT
|
||
|
* * ****
|
||
|
GGGAAATTT
|
||
|
1
|
||
|
|
||
|
270
|
||
|
AGGGATTTT
|
||
|
** * ***
|
||
|
GGGAAATTT
|
||
|
1
|
||
|
|
||
|
271
|
||
|
GGGATTTTC
|
||
|
**** **
|
||
|
GGGAAATTT
|
||
|
1
|
||
|
? (y/n) (y) Keep picture !
|
||
|
|
||
|
@20. TX 7 @ Search for a motif using a weight matrix
|
||
|
|
||
|
This function performs searches for short sequence motifs
|
||
|
using an appropriate weight matrix. In addition it can be used to
|
||
|
create or modify weight matrices. In order to perform a search the
|
||
|
only input required is the name of the file containing the weight
|
||
|
matrix. The results can be presented graphically or listed. The
|
||
|
graphical presentation will draw line at the position of any matches
|
||
|
found; the height of the line is proportional to the score.
|
||
|
|
||
|
For a search, select "use weight matrix", supply the name of
|
||
|
the file containing the weight matrix, and choose between having
|
||
|
results plotted or listed. If dialogue is requested when the
|
||
|
function is selected users can alter the cutoff score employed.
|
||
|
|
||
|
To create a weight matrix several steps are involved. A file
|
||
|
containing an alignment of known motifs is required. (This file must
|
||
|
be created before the current option is selected. The format is a
|
||
|
follows: each sequence is written on a separate line with at least
|
||
|
one space at the beginning; each sequence is terminated by a space
|
||
|
character, and can be followed by a name. The sequences must be
|
||
|
aligned.) Supply the name of the file of aligned sequences. The
|
||
|
program reads and displays the sequences. Choose between "summing
|
||
|
logs of weights" or summing weights (i.e. whether to multiply or add
|
||
|
weights). If logs are used all scores will be negative. Choose if
|
||
|
all positions in the set of aligned sequences should be used or if a
|
||
|
mask should be applied. If so selected, define a mask as a string of
|
||
|
symbols, in which symbol - means ignore and any other symbol means
|
||
|
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
|
||
|
|
||
|
The program will calculate weights as the frequencies of each
|
||
|
base at each unmasked position in the set of aligned sequences.
|
||
|
These weights are then applied to the set of aligned sequences to
|
||
|
give a range of "observed" scores. The mean and standard deviation
|
||
|
of these scores is displayed. The user is asked to supply several
|
||
|
values to be used when the weight matrix is applied to other
|
||
|
sequences: a cutoff score (by default, the mean minus 3 standard
|
||
|
deviations), a top score for scaling graphical results (by default,
|
||
|
the mean plus 3 standard deviations), and a position to identify
|
||
|
(this means that if a particular base within the motif is used as a
|
||
|
"landmark", such as the A of the AG in splice acceptor sites, then
|
||
|
its position will be marked in plots). All these values are stored
|
||
|
along with the weight matrix. Finally supply the name of a file to
|
||
|
contain the weight matrix.
|
||
|
|
||
|
Weight matrices can be "rescaled" using a set of aligned
|
||
|
sequences in much the same ways as a matrix is created. The purpose
|
||
|
is to redefine the cutoff scores, and rescaling does not alter any
|
||
|
other values in the weight matrix file.
|
||
|
|
||
|
The methods have changed considerably but were first outlined
|
||
|
in Staden, R. Nucl. Acid Res. 12 505-519 1984, and Staden, R.
|
||
|
Genetic engineering: principles and methods vol 7, Edited by J.K.
|
||
|
Setlow and A. Hollaender, Plenum publishing corp., 1985.
|
||
|
|
||
|
The methods have always had to deal with the problem of zeroes
|
||
|
in the matrices. The current versions employ "Laplaces Law of
|
||
|
Succession" in which 1 is added to each term.
|
||
|
|
||
|
It is now possible to apply a mask to a set of aligned
|
||
|
sequences in order to give weight to selected positions only.
|
||
|
Sequences have superimposed functions: some parts may be of general
|
||
|
structural importance and give rise to an overall framework, and
|
||
|
other parts give specificity and hence are not common; we may want
|
||
|
to use a set of aligned sequences to define a motif, but want to use
|
||
|
only the framework positions. Alternatively we may want to pick out
|
||
|
only those parts of a set of aligned sequences that give a
|
||
|
particular property, and to ignore other similarities that are due
|
||
|
to some other property and which could obscure the pattern we are
|
||
|
interested in. The ability to define a mask allows certain positions
|
||
|
to be used in the motif and others to be ignored, and yet still
|
||
|
permits the use of a set of aligned sequences to calculate weights.
|
||
|
|
||
|
Typical dialogue is shown below.
|
||
|
|
||
|
? Menu or option number=20
|
||
|
X 1 Use weight matrix
|
||
|
2 Make weight matrix
|
||
|
3 Rescale weight matrix
|
||
|
? 0,1,2,3 =2
|
||
|
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
|
||
|
|
||
|
1 AGCGTGACTCTTCCCGGAA HIS1
|
||
|
2 GAGGTGACTCACTTGGAAG HIS1
|
||
|
3 CGGATGACTCTTTTTTTTT HIS3
|
||
|
4 ACAGTGACTCACGTTTTTT HIS4
|
||
|
5 GTCGTGACTCATATGCTTT ARG3
|
||
|
6 TGAATGACTCACTTTTTGG ARG4
|
||
|
7 TTCTTGACTCGTCTTTTCT CPA1
|
||
|
8 CGAATGACTCTTATTGATG CPA2
|
||
|
9 AGAATGACTAATTTTACTA TRP5
|
||
|
10 TCGTTGACTCATTCTAATC TRP3
|
||
|
11 TTGCTGACTCATTACGATT TRP2
|
||
|
12 GAGATGACTCTTTTTCTTT IV1
|
||
|
13 GCGATGATTCATTTCTCTG IV2
|
||
|
14 TAGATGACTCAGTTTAGTC LEU1
|
||
|
15 TAAGTGACTCAGTTCTTTC LEU4
|
||
|
16 ATGATGACTCTTAAGCATG ILS1
|
||
|
Length of motif 19
|
||
|
? (y/n) (y) Sum logs of weights
|
||
|
|
||
|
? (y/n) (y) Use all motif positions n
|
||
|
x means use, - means ignore
|
||
|
e.g. xx-x---x-x means use positions 1,2,4,8,10
|
||
|
? Mask=----XXXXXXXX
|
||
|
Applying weights to input sequences
|
||
|
1 -27.979 AGCGTGACTCTTCCCGGAA
|
||
|
2 -24.543 GAGGTGACTCACTTGGAAG
|
||
|
3 -20.890 CGGATGACTCTTTTTTTTT
|
||
|
4 -23.087 ACAGTGACTCACGTTTTTT
|
||
|
5 -22.771 GTCGTGACTCATATGCTTT
|
||
|
6 -23.408 TGAATGACTCACTTTTTGG
|
||
|
7 -25.159 TTCTTGACTCGTCTTTTCT
|
||
|
8 -22.679 CGAATGACTCTTATTGATG
|
||
|
9 -24.751 AGAATGACTAATTTTACTA
|
||
|
10 -23.157 TCGTTGACTCATTCTAATC
|
||
|
11 -23.067 TTGCTGACTCATTACGATT
|
||
|
12 -21.449 GAGATGACTCTTTTTCTTT
|
||
|
13 -24.191 GCGATGATTCATTTCTCTG
|
||
|
14 -23.770 TAGATGACTCAGTTTAGTC
|
||
|
15 -22.923 TAAGTGACTCAGTTCTTTC
|
||
|
16 -25.285 ATGATGACTCTTAAGCATG
|
||
|
Top score -20.890 Bottom score -27.979
|
||
|
Mean -23.694 Standard deviation 1.613
|
||
|
Mean minus 3.sd -28.534 Mean plus 3.sd -18.854
|
||
|
? Cutoff score (-999.00-9999.00) (-28.53) =
|
||
|
? Top score for scaling plots (-28.53-999.00) (-18.85) =
|
||
|
? Position to identify (0-19) (1) =
|
||
|
? Title=GCN4 SEQUENCES
|
||
|
? Name for new weight matrix file=1.WTS
|
||
|
|
||
|
|
||
|
? Menu or option number=20
|
||
|
X 1 Use weight matrix
|
||
|
2 Make weight matrix
|
||
|
3 Rescale weight matrix
|
||
|
? 0,1,2,3 =3
|
||
|
? Name of existing weight matrix file=1.WTS
|
||
|
GCN4 SEQUENCES
|
||
|
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
|
||
|
Length of motif 19
|
||
|
? (y/n) (y) Sum logs of weights n
|
||
|
? (y/n) (y) Use all motif positions
|
||
|
|
||
|
Applying weights to input sequences
|
||
|
1 128.000 AGCGTGACTCTTCCCGGAA
|
||
|
2 148.000 GAGGTGACTCACTTGGAAG
|
||
|
3 172.000 CGGATGACTCTTTTTTTTT
|
||
|
4 160.000 ACAGTGACTCACGTTTTTT
|
||
|
5 161.000 GTCGTGACTCATATGCTTT
|
||
|
6 157.000 TGAATGACTCACTTTTTGG
|
||
|
7 149.000 TTCTTGACTCGTCTTTTCT
|
||
|
8 160.000 CGAATGACTCTTATTGATG
|
||
|
9 151.000 AGAATGACTAATTTTACTA
|
||
|
10 159.000 TCGTTGACTCATTCTAATC
|
||
|
11 158.000 TTGCTGACTCATTACGATT
|
||
|
12 169.000 GAGATGACTCTTTTTCTTT
|
||
|
13 152.000 GCGATGATTCATTTCTCTG
|
||
|
14 157.000 TAGATGACTCAGTTTAGTC
|
||
|
15 160.000 TAAGTGACTCAGTTCTTTC
|
||
|
16 143.000 ATGATGACTCTTAAGCATG
|
||
|
Top score 172.000 Bottom score 128.000
|
||
|
Mean 155.250 Standard deviation 10.034
|
||
|
Mean minus 3.sd 125.147 Mean plus 3.sd 185.353
|
||
|
? Cutoff score (-999.00-9999.00) (125.15) =
|
||
|
? Top score for scaling plots (125.15-999.00) (185.35) =
|
||
|
? Position to identify (0-19) (1) =
|
||
|
? Title=GCN4 SEQUENCES
|
||
|
? Name for new weight matrix file=2.WTS
|
||
|
|
||
|
|
||
|
? Menu or option number=20
|
||
|
X 1 Use weight matrix
|
||
|
2 Make weight matrix
|
||
|
3 Rescale weight matrix
|
||
|
? 0,1,2,3 =
|
||
|
? Motif weight matrix file=1.WTS
|
||
|
GCN4 SEQUENCES
|
||
|
? (y/n) (y) Plot results n
|
||
|
|
||
|
153 -22.61 GCAGCGACTGATTTGAGTT
|
||
|
169 -28.53 GTTCTGACCACTCAGATCC
|
||
|
172 -27.27 CTGACCACTCAGATCCGGC
|
||
|
219 -27.35 CCAGTGGCTGGCCTGCTAG
|
||
|
268 -27.82 CGAGGGATTTTCGATCTTG
|
||
|
274 -26.99 ATTTTCGATCTTGTGGATG
|
||
|
283 -25.79 CTTGTGGATGATTTTCACG
|
||
|
287 -27.50 TGGATGATTTTCACGTGCG
|
||
|
298 -28.17 CACGTGCGCCGTCATATTG
|
||
|
332 -28.27 TCTTTGAAGCAGAAGGGAC
|
||
|
351 -28.27 AGGGGTACACTTTCACATT
|
||
|
357 -25.05 ACACTTTCACATTTCGCTT
|
||
|
364 -28.51 CACATTTCGCTTATGGGAG
|
||
|
400 -23.77 GAAGTTACTAATGTGCGTG
|
||
|
451 -26.22 ATGCTCGCCCTCTTTGGTG
|
||
|
476 -28.00 TCCCTCACTGAGCCCTCCG
|
||
|
480 -28.33 TCACTGAGCCCTCCGCCTC
|
||
|
517 -23.46 GCTAAGATTCAGCTTGGTT
|
||
|
556 -27.27 TCCAGCACTCAGGTTCGGC
|
||
|
602 -27.01 AACTTGAATCCATCGTTGC
|
||
|
648 -28.45 TGCTAAACACAGCCGGTTT
|
||
|
679 -28.18 CTGTTTGCCCAGTTTGGGC
|
||
|
691 -28.51 TTTGGGCCGCTTCTGGACG
|
||
|
713 -27.67 GGCTTGACCGTGGCTGTGG
|
||
|
803 -25.47 ATGCTGACCATGCTTTTCA
|
||
|
848 -28.11 ATAATGTTAAGTTTGATTC
|
||
|
857 -25.97 AGTTTGATTCCGCTGGCCG
|
||
|
879 -27.85 CCGCTGCTGCTGTTTCCAC
|
||
|
917 -27.77 GCGATGAGGAAGGCTTGTT
|
||
|
931 -27.81 TTGTTGGCGCGCCTGCTCG
|
||
|
952 -23.52 GAGGTGACTACCATCCGTG
|
||
|
977 -28.40 TGCGTGGGTGAGCTGTTGT
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
? Menu or option number=6
|
||
|
Page through text files
|
||
|
? Name of file to read=1.WTS
|
||
|
GCN4 SEQUENCES
|
||
|
19 1 -28.534 -18.854
|
||
|
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
||
|
N 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
|
||
|
16
|
||
|
T 0 0 0 0 16 0 0 1 16 0 5 11 10 12 9 6 7 12 6
|
||
|
C 0 0 0 0 0 0 0 15 0 15 0 3 2 2 4 3 2 1 3
|
||
|
A 0 0 0 0 0 0 16 0 0 1 10 0 3 2 0 3 5 2 2
|
||
|
G 0 0 0 0 0 16 0 0 0 0 1 2 1 0 3 4 2 1 5
|
||
|
End of file
|
||
|
|
||
|
@21. TX 3 @ Count base composition
|
||
|
|
||
|
This routine calculates the base composition of the active
|
||
|
region of the sequence as both totals and percentages.
|
||
|
@22. TX 3 @ Count dinucleotide frequencies
|
||
|
|
||
|
This routine simply counts dinucleotide frequencies for the
|
||
|
currently active region of the sequence. It also calculates an
|
||
|
expected distribution based on the base composition. The output
|
||
|
looks like:
|
||
|
T C A G
|
||
|
obs expected obs expected obs expected obs expected
|
||
|
|
||
|
T 8.44 8.25 6.67 7.01 10.35 9.92 3.27 3.54
|
||
|
C 7.49 7.01 6.76 5.95 8.39 8.43 1.76 3.01
|
||
|
A 10.13 9.92 7.78 8.43 11.74 11.93 4.89 4.26
|
||
|
G 2.67 3.54 3.19 3.01 4.06 4.26 2.42 1.52
|
||
|
|
||
|
@23. TX 3 5 @ Count codons and amino acids
|
||
|
|
||
|
This function counts codons, amino acid composition, protein
|
||
|
molecular weights, and base composition. Users select the segments
|
||
|
of the sequence that the program should analyse.
|
||
|
|
||
|
Choose between being shown observed counts or counts
|
||
|
normalised so that the totals for each amino acid sum to 100. Select
|
||
|
to define segments using either the keyboard or an EMBL feature
|
||
|
table. Define the segments to count over. Select strand for each
|
||
|
segment. Stop selecting segments by typing a zero for "Count from
|
||
|
()". The results are displayed a screenful at a time, and the bell
|
||
|
is sounded to show there is more to come. A zero start position, or
|
||
|
the end of an EMBL feature table, signals the routine to print out
|
||
|
totals for all values.
|
||
|
|
||
|
The counts are broken down into several figures. Base
|
||
|
composition by position in codon expressed as a percentage of each
|
||
|
bases own frequency; base composition by position in codon
|
||
|
expressed as a percentage of the overall base composition of the
|
||
|
section; base composition expected for this amino acid composition
|
||
|
if there was no codon preference; percentage deviations of the
|
||
|
observed amino acid composition from an average amino acid
|
||
|
composition.
|
||
|
|
||
|
The output looks like:
|
||
|
|
||
|
===========================================
|
||
|
F TTT 1. S TCT 2. Y TAT 2. C TGT 1.
|
||
|
F TTC 1. S TCC 1. Y TAC 3. C TGC 2.
|
||
|
L TTA 7. S TCA 4. * TAA 9. * TGA 1.
|
||
|
L TTG 2. S TCG 1. * TAG 2. W TGG 2.
|
||
|
===========================================
|
||
|
L CTT 3. P CCT 2. H CAT 4. R CGT 1.
|
||
|
L CTC 2. P CCC 3. H CAC 1. R CGC 0.
|
||
|
L CTA 3. P CCA 2. Q CAA 4. R CGA 0.
|
||
|
L CTG 2. P CCG 2. Q CAG 1. R CGG 2.
|
||
|
===========================================
|
||
|
I ATT 9. T ACT 1. N AAT 7. S AGT 3.
|
||
|
I ATC 2. T ACC 2. N AAC 4. S AGC 2.
|
||
|
I ATA 4. T ACA 5. K AAA 13. R AGA 5.
|
||
|
M ATG 1. T ACG 2. K AAG 4. R AGG 1.
|
||
|
===========================================
|
||
|
V GTT 2. A GCT 2. D GAT 1. G GGT 3.
|
||
|
V GTC 2. A GCC 2. D GAC 1. G GGC 1.
|
||
|
V GTA 4. A GCA 3. E GAA 2. G GGA 1.
|
||
|
V GTG 2. A GCG 0. E GAG 1. G GGG 1.
|
||
|
===========================================
|
||
|
total codons= 166.
|
||
|
T C A G
|
||
|
|
||
|
1 31.06 33.68 34.03 35.00
|
||
|
2 35.61 35.79 30.89 32.50
|
||
|
3 33.33 30.53 35.08 32.50
|
||
|
|
||
|
1 24.70 19.28 39.16 16.87
|
||
|
2 28.31 20.48 35.54 15.66
|
||
|
3 26.51 17.47 40.36 15.66
|
||
|
% 26.51 19.08 38.35 16.06 observed, overall totals
|
||
|
% 25.00 22.26 33.10 19.65 expected, even codons per acid
|
||
|
|
||
|
A C D E F G H I K L
|
||
|
7. 3. 2. 3. 2. 6. 5. 15. 17. 19.
|
||
|
o-e % -47. -33. -76. -68. -64. -54. 62. 116. 67. 67.
|
||
|
|
||
|
M N P Q R S T V W Y
|
||
|
1. 11. 9. 5. 9. 13. 10. 10. 2. 5.
|
||
|
o-e % -62. 66. 12. -17. 19. 21. 6. -2. 0. -5.
|
||
|
total acids= 154. molecular weight= 17421.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
|
||
|
? Menu or option number=23
|
||
|
Calculate codon usage, base composition
|
||
|
and amino acid composition
|
||
|
? (y/n) (y) Show observed counts
|
||
|
? (y/n) (y) Define segments using keyboard
|
||
|
? Count from (0-1023) (0) =1
|
||
|
? Count to (1-1023) (1023) =1000
|
||
|
? (y/n) (y) + strand
|
||
|
|
||
|
===========================================
|
||
|
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
|
||
|
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
|
||
|
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
|
||
|
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
|
||
|
===========================================
|
||
|
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
|
||
|
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
|
||
|
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
|
||
|
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
|
||
|
===========================================
|
||
|
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
|
||
|
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
|
||
|
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
|
||
|
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
|
||
|
===========================================
|
||
|
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
|
||
|
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
|
||
|
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
|
||
|
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
|
||
|
===========================================
|
||
|
|
||
|
|
||
|
Total codons= 333.
|
||
|
T C A G
|
||
|
|
||
|
1 23.32 37.69 28.99 40.06
|
||
|
2 37.15 22.31 38.46 36.59
|
||
|
3 39.53 40.00 32.54 23.34
|
||
|
----- ----- ----- -----
|
||
|
= 100% 100% 100% 100%
|
||
|
|
||
|
1 17.72 29.43 14.71 38.14 = 100%
|
||
|
2 28.23 17.42 19.52 34.83 = 100%
|
||
|
3 30.03 31.23 16.52 22.22 = 100%
|
||
|
% 25.33 26.03 16.92 31.73 Observed, overall totals
|
||
|
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
|
||
|
|
||
|
A C D E F G H I K L
|
||
|
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
|
||
|
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
|
||
|
|
||
|
M N P Q R S T V W Y
|
||
|
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
|
||
|
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
|
||
|
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
|
||
|
|
||
|
|
||
|
? Count from (0-1023) (0) =
|
||
|
|
||
|
Codon totals over all genes
|
||
|
===========================================
|
||
|
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
|
||
|
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
|
||
|
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
|
||
|
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
|
||
|
===========================================
|
||
|
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
|
||
|
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
|
||
|
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
|
||
|
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
|
||
|
===========================================
|
||
|
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
|
||
|
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
|
||
|
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
|
||
|
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
|
||
|
===========================================
|
||
|
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
|
||
|
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
|
||
|
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
|
||
|
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
|
||
|
===========================================
|
||
|
|
||
|
|
||
|
Total codons= 333.
|
||
|
T C A G
|
||
|
|
||
|
1 23.32 37.69 28.99 40.06
|
||
|
2 37.15 22.31 38.46 36.59
|
||
|
3 39.53 40.00 32.54 23.34
|
||
|
----- ----- ----- -----
|
||
|
= 100% 100% 100% 100%
|
||
|
|
||
|
1 17.72 29.43 14.71 38.14 = 100%
|
||
|
2 28.23 17.42 19.52 34.83 = 100%
|
||
|
3 30.03 31.23 16.52 22.22 = 100%
|
||
|
% 25.33 26.03 16.92 31.73 Observed, overall totals
|
||
|
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
|
||
|
|
||
|
A C D E F G H I K L
|
||
|
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
|
||
|
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
|
||
|
|
||
|
M N P Q R S T V W Y
|
||
|
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
|
||
|
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
|
||
|
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
|
||
|
|
||
|
@24. TX 3 @ Plot base composition
|
||
|
|
||
|
This option plots the base composition of the sequence. The
|
||
|
counts for any combination of bases can be plotted.
|
||
|
|
||
|
If dialogue is requested the user is presented with a check
|
||
|
box for selecting which bases should be counted, and then allowed to
|
||
|
define a window length, and a "plot interval". Otherwise, the AT
|
||
|
composition is plotted with a window of 101 and a plot interval of
|
||
|
5.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=d24
|
||
|
Plot base composition
|
||
|
|
||
|
checkbox: those set are marked X
|
||
|
X 1 T
|
||
|
2 C
|
||
|
X 3 A
|
||
|
4 G
|
||
|
? 0,1,2,3,4 =1
|
||
|
|
||
|
checkbox: those set are marked X
|
||
|
1 T
|
||
|
2 C
|
||
|
X 3 A
|
||
|
4 G
|
||
|
? 0,1,2,3,4 =3
|
||
|
|
||
|
checkbox: those set are marked X
|
||
|
1 T
|
||
|
2 C
|
||
|
3 A
|
||
|
4 G
|
||
|
? 0,1,2,3,4 =2
|
||
|
|
||
|
checkbox: those set are marked X
|
||
|
1 T
|
||
|
X 2 C
|
||
|
3 A
|
||
|
4 G
|
||
|
? 0,1,2,3,4 =4
|
||
|
|
||
|
checkbox: those set are marked X
|
||
|
1 T
|
||
|
X 2 C
|
||
|
3 A
|
||
|
X 4 G
|
||
|
? 0,1,2,3,4 =
|
||
|
|
||
|
? odd span length (1-201) (31) =
|
||
|
? plot interval (1-11) (5) =
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
|
||
|
|
||
|
@25. TX 3 @ Plot local deviations in base composition
|
||
|
|
||
|
The "local deviation" routines are designed to indicate the
|
||
|
similarity of the compositions of different parts of the sequence.
|
||
|
The composition of every segment of the sequence is compared with a
|
||
|
standard composition. The levels of similarity are plotted as a chi
|
||
|
squared values. The standard can be the composition of the whole
|
||
|
sequence, or alternatively that of a small segment defined by the
|
||
|
user.
|
||
|
|
||
|
If dialogue is forced define the standard region, the window
|
||
|
length and the plot interval. Otherwise the composition of the whole
|
||
|
sequence is taken as a standard. The maximum and minimum observed
|
||
|
value of the chi squared calculation is displayed, and plots will
|
||
|
always exactly fill the available box. Any unusual regions will show
|
||
|
as peaks.
|
||
|
|
||
|
The following measure is used: for each window position
|
||
|
calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the
|
||
|
observed composition and exp is the expected composition (the
|
||
|
composition of the standard). The calculation is performed once to
|
||
|
find out the range of values and is then repeated and plotted so
|
||
|
that the plot exactly fills the allocated screen space.
|
||
|
@26. TX 3 @ Plot local deviations from dinucleotide composition
|
||
|
|
||
|
The "local deviation" routines are designed to indicate the
|
||
|
similarity of the compositions of different parts of the sequence.
|
||
|
The dinucleotide composition of every segment of the sequence is
|
||
|
compared with a standard composition. The levels of similarity are
|
||
|
plotted as a chi squared values. The standard can be the composition
|
||
|
of the whole sequence, or alternatively that of a small segment
|
||
|
defined by the user.
|
||
|
|
||
|
If dialogue is forced define the standard region, the window
|
||
|
length and the plot interval. Otherwise the composition of the whole
|
||
|
sequence is taken as a standard. The maximum and minimum observed
|
||
|
value of the chi squared calculation is displayed, and plots will
|
||
|
always exactly fill the available box. Any unusual regions will show
|
||
|
as peaks.
|
||
|
|
||
|
The following measure is used: for each window position
|
||
|
calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the
|
||
|
observed composition and exp is the expected composition (the
|
||
|
composition of the standard). The calculation is performed once to
|
||
|
find out the range of values and is then repeated and plotted so
|
||
|
that the plot exactly fills the allocated screen space.
|
||
|
@27. TX 3 @ Plot local deviations from trinucleotide composition
|
||
|
|
||
|
The "local deviation" routines are designed to indicate the
|
||
|
similarity of the compositions of different parts of the sequence.
|
||
|
The trinucleotide composition of every segment of the sequence is
|
||
|
compared with a standard composition. The levels of similarity are
|
||
|
plotted as a chi squared values. The standard can be the composition
|
||
|
of the whole sequence, or alternatively that of a small segment
|
||
|
defined by the user.
|
||
|
|
||
|
If dialogue is forced define the standard region, the window
|
||
|
length and the plot interval. Otherwise the composition of the whole
|
||
|
sequence is taken as a standard. The maximum and minimum observed
|
||
|
value of the chi squared calculation is displayed, and plots will
|
||
|
always exactly fill the available box. Any unusual regions will show
|
||
|
as peaks.
|
||
|
|
||
|
The following measure is used: for each window position
|
||
|
calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the
|
||
|
observed composition and exp is the expected composition (the
|
||
|
composition of the standard). The calculation is performed once to
|
||
|
find out the range of values and is then repeated and plotted so
|
||
|
that the plot exactly fills the allocated screen space.
|
||
|
@28. TX 5 @ Calculate codon constraint
|
||
|
|
||
|
The purpose of this option (which is somewhat specialised) is
|
||
|
to measure the level of constraint imposed on the sequence by coding
|
||
|
for a protein of the observed composition. It measures the strength
|
||
|
of the codon bias averaged over windows of 99 codons and displays
|
||
|
the values observed.
|
||
|
|
||
|
Select between defining segments at the keyboard or using an
|
||
|
EMBL feature table. Finish selecting segments by typing a zero
|
||
|
start. The value for each segment is displayed:
|
||
|
|
||
|
Mean (W-EW) / EWD, window 99 10.5
|
||
|
|
||
|
The codon constraint is the difference between the observed
|
||
|
codon improbability and the mean improbabilty for a sequence of the
|
||
|
same composition. See McLachlan, Staden and Boswell Nucl. Acid
|
||
|
Res. 1984
|
||
|
@59. TX 3 @ Plot negentropy
|
||
|
|
||
|
This routine is designed to show regions of the sequence that
|
||
|
differ in composition from others, and hence is like the "plot
|
||
|
deviation.." routines.
|
||
|
|
||
|
Negentropy or information is defined in the following way: let
|
||
|
Pi be the probability of observing base i, where i = A,C,G or T,
|
||
|
then the average information per base is I=-sum(Pi.Log(Pi)) (sum
|
||
|
over all i). This routine calculates Pi by calculating the overall
|
||
|
composition for the sequence and then plots I for windows of length
|
||
|
defined by the user.
|
||
|
@30. TX 4 @ Search for hairpin loops
|
||
|
|
||
|
Used to find simple inverted repeats or potential hairpin
|
||
|
loops The loops are defined by a range of sizes for the loop and a
|
||
|
minimum number of consecutive base pairs in the stem. The results
|
||
|
can be presented graphically or listed. A-T, G-C and G-T basepairs
|
||
|
are counted.
|
||
|
|
||
|
Define the range of loop sizes and the minimum number of
|
||
|
consecutive basepairs required. Choose between plotted or listed
|
||
|
results.
|
||
|
|
||
|
The loops found are plotted as blips on a horizontal line that
|
||
|
represents the sequence, the heights of the lines are proportional
|
||
|
to the number of basepairs in the stems. Note that only
|
||
|
uninterrupted stems are found - i.e. all basepairs must be made. To
|
||
|
look for stems with some unpaired bases (or for palindromes) use the
|
||
|
inverted repeat motif class in the pattern searching option.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=30
|
||
|
Search for hairpin loops
|
||
|
Define the range of loop sizes
|
||
|
? Minimum loop size (1-30) (1) =
|
||
|
? Maximum loop size (3-60) (3) =
|
||
|
? Minimum number of basepairs (2-20) (6) =
|
||
|
? (y/n) (y) Plot results n
|
||
|
Searching
|
||
|
|
||
|
T.G
|
||
|
G-C
|
||
|
G.T
|
||
|
T.G
|
||
|
C-G
|
||
|
G-C
|
||
|
T.G
|
||
|
C-G
|
||
|
G.T
|
||
|
GCCGCA GCGGAGG
|
||
|
49
|
||
|
|
||
|
G
|
||
|
G-C
|
||
|
T.G
|
||
|
C-G
|
||
|
G.T
|
||
|
T.G
|
||
|
G-C
|
||
|
CTGCTG GGAGGTC
|
||
|
56
|
||
|
|
||
|
|
||
|
G
|
||
|
T.G
|
||
|
G-C
|
||
|
G.T
|
||
|
T.G
|
||
|
C-G
|
||
|
G-C
|
||
|
T-A
|
||
|
T.G
|
||
|
AGCGCA CGACTGA
|
||
|
139
|
||
|
|
||
|
A C
|
||
|
G.T
|
||
|
C-G
|
||
|
G.T
|
||
|
C-G
|
||
|
C-G
|
||
|
G-C
|
||
|
TTCGCT CAACGCC
|
||
|
244
|
||
|
|
||
|
@31. TX 4 @ Search for long range inverted repeats
|
||
|
|
||
|
Searches for inverted repeats. The repeats found are exact
|
||
|
matches of at least 6 consecutive bases. Results can be presented
|
||
|
graphically or listed. Plotted results show the end points of
|
||
|
repeats joined by rectangular lines.
|
||
|
|
||
|
If dialogue is not requested the defaults will be taken.
|
||
|
Otherwise choose between plotted or listed results. If required
|
||
|
select to analyse a restricted segment of the currently active
|
||
|
region. Choose a repeat length.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=D31
|
||
|
Plot long-range inverted repeats
|
||
|
? (y/n) (y) Plot results n
|
||
|
Define restricted region
|
||
|
? start (1-1023) (1) =
|
||
|
? end (2-1023) (1023) =
|
||
|
? Minimum inverted repeat (6-30) (12) =10
|
||
|
Searching
|
||
|
27 909 10 TGCCCAGAGA
|
||
|
|
||
|
@32. TX 4 @ Search for repeats
|
||
|
|
||
|
Searches for direct repeats. The repeats found are exact
|
||
|
matches of at least 6 consecutive bases. Results can be presented
|
||
|
graphically or listed. Plotted results show the end points of
|
||
|
repeats joined by rectangular lines.
|
||
|
|
||
|
If dialogue is not requested the defaults will be taken.
|
||
|
Otherwise choose between plotted or listed results. If required
|
||
|
select to analyse a restricted segment of the currently active
|
||
|
region. Choose a repeat length.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=D32
|
||
|
Plot repeats
|
||
|
? (y/n) (y) Plot results n
|
||
|
Define restricted region
|
||
|
? start (1-1023) (1) =
|
||
|
? end (2-1023) (1023) =
|
||
|
? Minimum repeat (6-30) (12) =8
|
||
|
Searching
|
||
|
619 988 8 GCTGTTGT
|
||
|
514 646 8 GCTGCTAA
|
||
|
94 865 8 TCCGCTGG
|
||
|
146 222 9 GTGGCTGGC
|
||
|
455 497 8 TCGCCCTC
|
||
|
454 496 9 CTCGCCCTC
|
||
|
872 875 8 GCCGCCGC
|
||
|
510 615 8 CGTTGCTG
|
||
|
152 913 8 GGCAGCGA
|
||
|
199 265 8 CGTCGAGG
|
||
|
689 794 8 AGTTTGGG
|
||
|
147 223 8 TGGCTGGC
|
||
|
101 116 8 GACGAGGA
|
||
|
8 690 8 GTTTGGGC
|
||
|
52 141 8 TGCTGGTG
|
||
|
|
||
|
@33. TX 4 @ Search for z dna (total ry, yr)
|
||
|
|
||
|
Searches for segments of the sequence that might form Z DNA. A
|
||
|
window length is chosen and the number of RY and YR dinucleotides
|
||
|
within each window is plotted. The top of the box corresponds to all
|
||
|
RY or YR, the bottom to zero RY or YR.
|
||
|
|
||
|
If dialogue is requested, select a window length and plot
|
||
|
interval. Otherwise the defaults will be used.
|
||
|
|
||
|
The program contains three separate ways of doing this
|
||
|
(options 33,34,35).
|
||
|
@34. TX 4 @ Search for z dna (runs of ry, yr)
|
||
|
|
||
|
Searches for segments of the sequence that might form Z DNA.
|
||
|
Results are plotted.
|
||
|
|
||
|
If dialogue is requested define a window length and plot
|
||
|
interval. Otherwise the defaults will be used. The routine counts
|
||
|
the number of R in positions 1,3,5 etc =R1, the number of Y in
|
||
|
positions 2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2
|
||
|
and the number of R in positions 2,4,6 etc =R2 for a window length.
|
||
|
It plots the maximum of R1+Y1 and R2+Y2 relative to a minimum of
|
||
|
(window length)/2 and a maximum of (window length). (see 33,35).
|
||
|
@35. TX 4 @ Search for z dna (best phased value)
|
||
|
|
||
|
Searches for segments of the sequence that might form Z DNA.
|
||
|
Results are plotted.
|
||
|
|
||
|
If dialogue is requested define a window length and a plot
|
||
|
interval. Ohterwise the defaults values will be used.
|
||
|
|
||
|
The routine counts the number of consecutive RY or YR
|
||
|
dinucleotides in phase. It moves through the sequence counting the
|
||
|
number of RY or YR dinucleotides; when the next dinucleotide is not
|
||
|
of the correct type the score is set back to zero and the search
|
||
|
restarted using the current base to set the phase. The plots are
|
||
|
done relative to a minimum of zero and a maximum defined by the
|
||
|
user. (See 33,34).
|
||
|
@36. TX 4 @ Local similarity or complementarity search
|
||
|
|
||
|
This function is designed to find segments of local similarity
|
||
|
or complementarity. It is therefore like performing a DIAGON plot
|
||
|
that is restricted to regions near the main diagonal. Results can
|
||
|
be presented graphically or listed.
|
||
|
|
||
|
Users define a region to search through, a span length, a
|
||
|
range for searching through and a cut-off score. The program takes
|
||
|
all sections of sequence of length span within the defined region
|
||
|
and compares them to all other sequences within the region and range
|
||
|
specified. If a match above the cutoff is found we need to show the
|
||
|
position of the two sections of sequence and the score, and we do it
|
||
|
in the following way. If we have a 70% match between a sequence
|
||
|
that starts at p1 and a sequence that starts at p2 the program draws
|
||
|
a diagonal line that starts at p1 with height 70% of the box and
|
||
|
which finishes at p2 with height 0. The matches can also be listed.
|
||
|
|
||
|
Here I define the terms range, region, and span and what is
|
||
|
compared. Suppose we have a defined region j1 to j2, a range of i1
|
||
|
to i2 and a span of s; the program will take, in turn, all sections
|
||
|
of sequence of length s within j1 and j2 and compare them to all
|
||
|
sequences that start a distance i1+s-1 to i2+s-1 away from them.
|
||
|
First it will take the sequence of length s starting at j1 and
|
||
|
compare it with the sequence of length s starting at j1+s-1+i1, then
|
||
|
j1+s-1+i1+1, etc up to j1+s-1+i2; then it will take the sequence of
|
||
|
length s starting at j1+1 and compare it with the sequence starting
|
||
|
at j1+s-1+1+i1 etc. This continues until we hit the right hand end
|
||
|
of the sequence as defined by j2. Note 1)that sequences are not
|
||
|
compared with themselves: the nearest sequence compared to a span s
|
||
|
starting at j starts at j+s; 2) ranges i1 and i2 are ranges of start
|
||
|
positions; 3) by choosing a range greater than the length of the
|
||
|
sequence this routine will do a full DIAGON analysis except for
|
||
|
those points within a distance span of the main diagonal (see note
|
||
|
1).
|
||
|
|
||
|
Typical dialog follows.
|
||
|
|
||
|
? Menu or option number=36
|
||
|
Search for local similarity or complementarity
|
||
|
? (y/n) (y) Find direct repeats
|
||
|
? (y/n) (y) Keep picture n
|
||
|
? Span (5-200) (15) =
|
||
|
Define restricted region
|
||
|
? start (0-1023) (1) =
|
||
|
? end (2-1023) (1023) =
|
||
|
? Percent match (1.00-100.00) (70.00) =
|
||
|
? Range start (1-50) (1) =
|
||
|
? Range end (1-50) (1) =5
|
||
|
? (y/n) (y) Plot results n
|
||
|
Working
|
||
|
|
||
|
|
||
|
118 128
|
||
|
CGAGGAGGAG GTGGA
|
||
|
** ***** ** **
|
||
|
GGACGAGGAC GTCGA
|
||
|
100 110
|
||
|
|
||
|
|
||
|
119 129
|
||
|
GAGGAGGAGG TGGAT
|
||
|
** ***** * * **
|
||
|
GACGAGGACG TCGAC
|
||
|
101 111
|
||
|
? (y/n) (y) Find direct repeats n
|
||
|
? (y/n) (y) Keep picture
|
||
|
? Span (5-200) (15) =
|
||
|
Define restricted region
|
||
|
? start (0-1023) (1) =
|
||
|
? end (2-1023) (1023) =
|
||
|
? Percent match (1.00-100.00) (70.00) =
|
||
|
? Range start (1-50) (1) =
|
||
|
? Range end (1-50) (5) =8
|
||
|
? (y/n) (y) List results
|
||
|
|
||
|
Working
|
||
|
|
||
|
|
||
|
178 188
|
||
|
ACTCAGATCC GGCGG
|
||
|
***** *** * **
|
||
|
ACTCAAATCA GTCGC
|
||
|
156 166
|
||
|
|
||
|
|
||
|
177 187
|
||
|
CACTCAGATC CGGCG
|
||
|
***** *** * **
|
||
|
AACTCAAATC AGTCG
|
||
|
157 167
|
||
|
? (y/n) (y) Find inverted repeats !
|
||
|
@37. TX 5 @ Set genetic code
|
||
|
|
||
|
This function allows the user to change the current active
|
||
|
genetic code for all the options. The user may select: the standard
|
||
|
code, the mammalian mitochondrial code, the yeast mitochondrial code
|
||
|
or a personal code (define your own).
|
||
|
|
||
|
Select code. If personal, define a codon and select an amino
|
||
|
acid. When all codons have been reset define a blank codon.
|
||
|
|
||
|
The code differences are:
|
||
|
Mammalian Yeast
|
||
|
Codon Mitochondrial Mitochondrial Standard
|
||
|
UGA W W STOP
|
||
|
AUA M M I
|
||
|
CUA L T L
|
||
|
AGA STOP R R
|
||
|
AGG STOP R R
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=37
|
||
|
X 1 Standard code
|
||
|
2 Mammalian mitochondrial code
|
||
|
3 Yeast mitochondrial code
|
||
|
4 Personal code
|
||
|
? 0,1,2,3,4 =2
|
||
|
|
||
|
? Menu or option number=37
|
||
|
X 1 Standard code
|
||
|
2 Mammalian mitochondrial code
|
||
|
3 Yeast mitochondrial code
|
||
|
4 Personal code
|
||
|
? 0,1,2,3,4 =4
|
||
|
Define genetic code by typing a codon
|
||
|
followed by a 1 letter amino acid symbol
|
||
|
? Codon=TTT
|
||
|
Default Amino acid symbol=F
|
||
|
? Amino acid symbol=W
|
||
|
? Codon=
|
||
|
@38. T 3 4 @ Examine repeats
|
||
|
|
||
|
This function can be used to examine the frequencies of
|
||
|
repeated words within a sequence. It finds all words that occur more
|
||
|
than once. The user selects a minimum word length and the program
|
||
|
finds all words of that length that occur more than once; then it
|
||
|
"follows" each repeated word until it becomes unique. For each word
|
||
|
length it can report the number of different repeated words, the
|
||
|
number of occurrences of each word, and their actual positions and
|
||
|
sequences.
|
||
|
|
||
|
It is possible that the algorithm may run out of memory,
|
||
|
paticularly if a short mimimum word length is chosen, or if the
|
||
|
sequence is very long or very repetitive. If this occurs the longest
|
||
|
reported word length will not necessarily be the longest in the
|
||
|
sequence: the memory will have been consumed before the longest word
|
||
|
is found.
|
||
|
Typical dialogue and output is shown below.
|
||
|
|
||
|
Expected length of longest repeat 14
|
||
|
? Minumim word length (1-6) (6) =6
|
||
|
Working
|
||
|
? Show repeat frequencies for words of at least length (6-15) (15) =10
|
||
|
For length 10 the number of different repeated words is 2035
|
||
|
For length 11 the number of different repeated words is 613
|
||
|
For length 12 the number of different repeated words is 161
|
||
|
For length 13 the number of different repeated words is 37
|
||
|
For length 14 the number of different repeated words is 10
|
||
|
For length 15 the number of different repeated words is 1
|
||
|
? Show repeats for words of length (6-15) (15) =14
|
||
|
? Show repeats for words occuring with frequency (2-9999) (2) =2
|
||
|
|
||
|
ggtgctcatgccca
|
||
|
occurs at 21611
|
||
|
occurs at 21851
|
||
|
ttatccggtgatga
|
||
|
occurs at 4604
|
||
|
occurs at 8806
|
||
|
agcaccacgctgac
|
||
|
occurs at 5954
|
||
|
occurs at 9486
|
||
|
catgacggaggatg
|
||
|
occurs at 10480
|
||
|
occurs at 19925
|
||
|
aaagacgggaaaat
|
||
|
occurs at 11820
|
||
|
occurs at 43157
|
||
|
tacaaaaccaattt
|
||
|
occurs at 26797
|
||
|
occurs at 31369
|
||
|
cgagaaagagtgcg
|
||
|
occurs at 4260
|
||
|
occurs at 44305
|
||
|
gccggatgatggcg
|
||
|
occurs at 7893
|
||
|
occurs at 16638
|
||
|
atgacggaggatga
|
||
|
occurs at 10481
|
||
|
occurs at 19926
|
||
|
gcggcgaacgaggc
|
||
|
occurs at 11352
|
||
|
occurs at 18718
|
||
|
? Show repeats for words of length (6-15) (15) =!
|
||
|
|
||
|
Example of not enough memory
|
||
|
----------------------------
|
||
|
|
||
|
Expected length of longest repeat 14
|
||
|
? Minumim word length (1-6) (6) =1
|
||
|
Working
|
||
|
Not enough memory
|
||
|
Memory used in bytes 1125996. Length of longest repeat 5
|
||
|
? Show repeat frequencies for words of at least length (1-5) (5) =!
|
||
|
|
||
|
@39. TX 5 @ Translate and list in upto six phases
|
||
|
|
||
|
This is a general listing function that will perform
|
||
|
translations and produce several forms of output. The possibilities
|
||
|
are:
|
||
|
1) no translation, list one or two strands, two ways of numbering the
|
||
|
sequence.
|
||
|
2) translation, one or two strands, one or three letter codes.
|
||
|
Positions defined by:
|
||
|
a) open reading frames of some minimum length l, l can be 0, hence giving
|
||
|
a complete six phase translation.
|
||
|
b) positions typed on keyboard, again 1 to 6 phases, translations appearing
|
||
|
above and below the dna.
|
||
|
c) positions read from a feature table.
|
||
|
|
||
|
It should be used in preference to option 5. For publication
|
||
|
without a translation, the option to number ends of lines is more compact
|
||
|
than option 5. Some examples and typical dialogue are given below. Note the
|
||
|
requirement for d39.
|
||
|
|
||
|
? Menu or option number=D39
|
||
|
Find open reading frames, translate and list
|
||
|
? (y/n) (y) Show translation
|
||
|
|
||
|
The segments to translate can be
|
||
|
1 Typed on the keyboard
|
||
|
2 Read from a feature table
|
||
|
X 3 Open reading frames
|
||
|
? 1,2,3 =
|
||
|
? Minimum open frame in amino acids (0-7238) (30) =
|
||
|
? (y/n) (y) Use 1 letter codes
|
||
|
Define section of DNA to display
|
||
|
? start (1-7238) (1) =
|
||
|
? end (2-7238) (7238) =300
|
||
|
? Line length (30-120) (60) =
|
||
|
Which strands should be shown
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 1,2,3 =3
|
||
|
? (y/n) (y) Number ends of lines
|
||
|
|
||
|
|
||
|
N A T T I S R I D A T F S A R A P N E N
|
||
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
||
|
. : . : . : . : . : . :
|
||
|
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
|
||
|
* S A G W I F I
|
||
|
A V V I L L I S A V K E A R A G F S F
|
||
|
|
||
|
I A K Q V I D H L R N V S N G Q T K S T
|
||
|
L N R L L T I C E M Y L M V K L N L L
|
||
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
||
|
. : . : . : . : . : . :
|
||
|
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
|
||
|
Y S F L N N V M Q S I Y R I T L S F R S
|
||
|
I A L C T I S W K R F T D L P * V L D V
|
||
|
|
||
|
R S Q N W E S T V T W N E T S R H R T L
|
||
|
V R R I G N Q L L H G M K L P D T V L *
|
||
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
||
|
. : . : . : . : . : . :
|
||
|
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
|
||
|
T R L I P F
|
||
|
R E C F Q S D V T V H F S V E L C R V K
|
||
|
|
||
|
V A Y L K H V E L Q H Q I Q Q L S S K P
|
||
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
||
|
. : . : . : . : . : . :
|
||
|
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
|
||
|
T A Y K F C T S S C C W I
|
||
|
|
||
|
S A K M T S Y Q K E Q L K V L S N P D L
|
||
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
||
|
. : . : . : . : . : . :
|
||
|
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
|
||
|
|
||
|
|
||
|
? Menu or option number=D39
|
||
|
Find open reading frames, translate and list
|
||
|
? (y/n) (y) Show translation N
|
||
|
Define section of DNA to display
|
||
|
? start (1-7238) (1) =
|
||
|
? end (2-7238) (7238) =300
|
||
|
? Line length (30-120) (60) =
|
||
|
Which strands should be shown
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 1,2,3 =
|
||
|
? (y/n) (y) Number ends of lines
|
||
|
|
||
|
|
||
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
||
|
|
||
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
||
|
|
||
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
||
|
|
||
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
||
|
|
||
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
||
|
|
||
|
|
||
|
? Menu or option number=D39
|
||
|
Find open reading frames, translate and list
|
||
|
? (y/n) (y) Show translation
|
||
|
The segments to translate can be
|
||
|
1 Typed on the keyboard
|
||
|
2 Read from a feature table
|
||
|
X 3 Open reading frames
|
||
|
? 1,2,3 =
|
||
|
? Minimum open frame in amino acids (0-7238) (30) =0
|
||
|
? (y/n) (y) Use 1 letter codes N
|
||
|
Define section of DNA to display
|
||
|
? start (1-7238) (1) =
|
||
|
? end (2-7238) (7238) =300
|
||
|
? Line length (30-120) (60) =
|
||
|
Which strands should be shown
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 1,2,3 =3
|
||
|
? (y/n) (y) Number ends of lines
|
||
|
|
||
|
|
||
|
AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn
|
||
|
ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle
|
||
|
ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys
|
||
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
||
|
. : . : . : . : . : . :
|
||
|
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
|
||
|
ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle
|
||
|
Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr
|
||
|
AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe
|
||
|
|
||
|
IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr
|
||
|
***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu
|
||
|
TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr
|
||
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
||
|
. : . : . : . : . : . :
|
||
|
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
|
||
|
TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer
|
||
|
Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu
|
||
|
IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal
|
||
|
|
||
|
ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu
|
||
|
ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu***
|
||
|
SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe
|
||
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
||
|
. : . : . : . : . : . :
|
||
|
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
|
||
|
ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer***
|
||
|
AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu
|
||
|
ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys
|
||
|
|
||
|
ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro
|
||
|
LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis
|
||
|
SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala
|
||
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
||
|
. : . : . : . : . : . :
|
||
|
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
|
||
|
AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp
|
||
|
GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet
|
||
|
ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly
|
||
|
|
||
|
SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu
|
||
|
ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys
|
||
|
IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro
|
||
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
||
|
. : . : . : . : . : . :
|
||
|
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
|
||
|
GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln
|
||
|
ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr
|
||
|
AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg
|
||
|
|
||
|
|
||
|
? Menu or option number=D39
|
||
|
Find open reading frames, translate and list
|
||
|
? (y/n) (y) Show translation
|
||
|
The segments to translate can be
|
||
|
1 Typed on the keyboard
|
||
|
2 Read from a feature table
|
||
|
X 3 Open reading frames
|
||
|
? 1,2,3 =1
|
||
|
? (y/n) (y) Use 1 letter codes
|
||
|
Define section of DNA to display
|
||
|
? start (1-7238) (1) =
|
||
|
? end (2-7238) (7238) =300
|
||
|
? Line length (30-120) (60) =
|
||
|
Which strands should be shown
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 1,2,3 =
|
||
|
? (y/n) (y) Number ends of lines N
|
||
|
Translate
|
||
|
? From (0-300) (0) =101
|
||
|
? To (1-300) (300) =300
|
||
|
Translate
|
||
|
? From (0-300) (0) =102
|
||
|
? To (1-300) (300) =200
|
||
|
Translate
|
||
|
? From (0-300) (0) =
|
||
|
|
||
|
|
||
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT
|
||
|
10 20 30 40 50 60
|
||
|
|
||
|
M V K L N L L
|
||
|
W S N * I Y
|
||
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT
|
||
|
70 80 90 100 110 120
|
||
|
|
||
|
V R R I G N Q L L H G M K L P D T V L *
|
||
|
S F A E L G I N C Y M E * N F Q T P Y F
|
||
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA
|
||
|
130 140 150 160 170 180
|
||
|
|
||
|
L H I * N M L S Y S T R F S N * A L S H
|
||
|
S C I F K T C
|
||
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA
|
||
|
190 200 210 220 230 240
|
||
|
|
||
|
P Q K * P L I K R S N * R Y S L I L T C
|
||
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG
|
||
|
250 260 270 280 290 300
|
||
|
|
||
|
|
||
|
? Menu or option number=D39
|
||
|
Find open reading frames, translate and list
|
||
|
? (y/n) (y) Show translation
|
||
|
The segments to translate can be
|
||
|
1 Typed on the keyboard
|
||
|
2 Read from a feature table
|
||
|
X 3 Open reading frames
|
||
|
? 1,2,3 =2
|
||
|
? Embl feature table file=1.FT
|
||
|
? (y/n) (y) Use 1 letter codes
|
||
|
Define section of DNA to display
|
||
|
? start (1-7238) (1) =
|
||
|
? end (2-7238) (7238) =300
|
||
|
? Line length (30-120) (60) =
|
||
|
Which strands should be shown
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 1,2,3 =3
|
||
|
? (y/n) (y) Number ends of lines
|
||
|
|
||
|
|
||
|
N A T T I S R I D A T F S A R A P N E N
|
||
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
||
|
. : . : . : . : . : . :
|
||
|
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
|
||
|
* S A G W I F I
|
||
|
A V V I L L I S A V K E A R A G F S F
|
||
|
|
||
|
I A K Q V I D H L R N V S N G Q T K S T
|
||
|
L N R L L T I C E M Y L M V K L N L L
|
||
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
||
|
. : . : . : . : . : . :
|
||
|
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
|
||
|
Y S F L N N V M Q S I Y R I T L S F R S
|
||
|
I A L C T I S W K R F T D L P * V L D V
|
||
|
|
||
|
R S Q N W E S T V T W N E T S R H R T L
|
||
|
V R R I G N Q L L H G M K L P D T V L *
|
||
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
||
|
. : . : . : . : . : . :
|
||
|
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
|
||
|
T R L I P F
|
||
|
R E C F Q S D V T V H F S V E L C R V K
|
||
|
|
||
|
V A Y L K H V E L Q H Q I Q Q L S S K P
|
||
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
||
|
. : . : . : . : . : . :
|
||
|
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
|
||
|
T A Y K F C T S S C C W I
|
||
|
|
||
|
S A K M T S Y Q K E Q L K V L S N P D L
|
||
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
||
|
. : . : . : . : . : . :
|
||
|
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
|
||
|
* L Y E R I R V Q
|
||
|
* F S C N F T S E L G S R
|
||
|
@40. TX 5 @ Translate and write the protein sequence to disk
|
||
|
|
||
|
This routine allows the user to translate sections of the
|
||
|
sequence into the 1 letter amino acid codes and store the resulting
|
||
|
amino acid sequences in a disk file. Two modes of use are possible.
|
||
|
Either all open reading frames of at least some minimum length will
|
||
|
automatically be found and translated, or the user can specify that
|
||
|
particular segments be translated.
|
||
|
|
||
|
Mode 1: the user selects to to translate all open reading
|
||
|
frames.
|
||
|
|
||
|
Either, or both, strands can be translated. The output file
|
||
|
is in the same format as a PIR .seq file. Each protein segment is
|
||
|
given an entry name that is its start base in the DNA, and a title
|
||
|
that includes its end position, reading frame and strand (+ for
|
||
|
plus, - for minus). Each segment is terminated by * whether or not
|
||
|
there is a stop codon in the DNA. The file is therefore suitable for
|
||
|
input to FASTA, ALIGNL and ANALYSEPL.
|
||
|
|
||
|
Mode 2: the user selects to identify the segments to
|
||
|
translate.
|
||
|
|
||
|
Either, or both, strands can be translated. If multiple
|
||
|
coding regions are translated each will be separated from the
|
||
|
previous one by a gap of 5 dashes (-----). The sections to
|
||
|
translate can be defined from the keyboard or by supplying the name
|
||
|
of the appropriate EMBL library feature table.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=40
|
||
|
Translate and write protein sequence to disk
|
||
|
? (y/n) (y) Translate selected regions
|
||
|
? (y/n) (y) Define segments using keyboard
|
||
|
Translate
|
||
|
? From (0-1023) (0) =1
|
||
|
? To (1-1023) (1023) =111
|
||
|
? (y/n) (y) + strand
|
||
|
Translate
|
||
|
? From (0-1023) (0) =
|
||
|
? Output file name=1.OUT
|
||
|
|
||
|
? Menu or option number=40
|
||
|
Translate and write protein sequence to disk
|
||
|
? (y/n) (y) Translate selected regions n
|
||
|
? Minimum open frame in amino acids (5-1000) (30) =
|
||
|
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 0,1,2,3 =3
|
||
|
? File name for translation=1.OUT
|
||
|
|
||
|
? Menu or option number=6
|
||
|
Page through text files
|
||
|
? Name of file to read=1.OUT
|
||
|
>P1; 25
|
||
|
135 1 +
|
||
|
GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG*
|
||
|
>P1; 238
|
||
|
486 1 +
|
||
|
IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR
|
||
|
VDAGADFAAVRMLALFGAVVPH*
|
||
|
>P1; 556
|
||
|
795 1 +
|
||
|
|
||
|
SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK
|
||
|
GAGRRDQWRGGGVDLANGR*
|
||
|
>P1; 796
|
||
|
987 1 +
|
||
|
|
||
|
FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV
|
||
|
LRG*
|
||
|
>P1; 2
|
||
|
163 2 +
|
||
|
NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD*
|
||
|
>P1; 176
|
||
|
391 2 +
|
||
|
PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG
|
||
|
YTFTFRLWEVT*
|
||
|
>P1; 476
|
||
|
628 2 +
|
||
|
SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL*
|
||
|
>P1; 629
|
||
|
958 2 +
|
||
|
MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML
|
||
|
TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR*
|
||
|
>P1; 3
|
||
|
293 3 +
|
||
|
IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS
|
||
|
DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG*
|
||
|
>P1; 411
|
||
|
521 3 +
|
||
|
CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC*
|
||
|
>P1; 1020
|
||
|
37 1 -
|
||
|
EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG
|
||
|
MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK
|
||
|
PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS
|
||
|
VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA
|
||
|
HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS
|
||
|
SSSSTSSSSGSLGPLTTSAPPAPAAAAQ*
|
||
|
>P1; 373
|
||
|
-1 2 -
|
||
|
AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS
|
||
|
AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS
|
||
|
PNRI*
|
||
|
>P1; 517
|
||
|
407 2 -
|
||
|
QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH*
|
||
|
>P1; 649
|
||
|
518 2 -
|
||
|
QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES*
|
||
|
>P1; 853
|
||
|
650 2 -
|
||
|
HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER
|
||
|
RCKTGCV*
|
||
|
>P1; 958
|
||
|
854 2 -
|
||
|
SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT*
|
||
|
>P1; 179
|
||
|
78 3 -
|
||
|
VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR*
|
||
|
>P1; 479
|
||
|
363 3 -
|
||
|
GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM*
|
||
|
>P1; 791
|
||
|
666 3 -
|
||
|
RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM*
|
||
|
>P1; 1022
|
||
|
819 3 -
|
||
|
SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS
|
||
|
EWRLVERR*
|
||
|
End of file
|
||
|
@41. TX 5 @ Calculate and write codon table to disk
|
||
|
|
||
|
This routine calculates codon usage tables for sections of the
|
||
|
sequence and stores the resulting tables on disk. The sections to
|
||
|
translate can be defined from the keyboard or by supplying the name
|
||
|
of the appropriate EMBL library feature table.
|
||
|
|
||
|
If required users can add to an existing codon table stored as
|
||
|
a disk file. Choose between storing observed counts or having them
|
||
|
normalised so that the totals for each amino acid sum to 100. Select
|
||
|
between defining segments at the keyboard or using an EMBL feature
|
||
|
table. Define segments. Signal completion with a zero start. Supply
|
||
|
a file name. For each segment the program will display the counts,
|
||
|
at the end it will display the accumulated totals.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=41
|
||
|
Calculate and write codon table to disk
|
||
|
? (y/n) (y) Start with empty table
|
||
|
? (y/n) (y) Show observed counts
|
||
|
? (y/n) (y) Define segments using keyboard
|
||
|
? Count from (0-1023) (0) =1
|
||
|
? Count to (1-1023) (1023) =111
|
||
|
? (y/n) (y) + strand
|
||
|
|
||
|
===========================================
|
||
|
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
|
||
|
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
|
||
|
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
|
||
|
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
|
||
|
===========================================
|
||
|
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
|
||
|
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
|
||
|
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
|
||
|
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
|
||
|
===========================================
|
||
|
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
|
||
|
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
|
||
|
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
|
||
|
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
|
||
|
===========================================
|
||
|
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
|
||
|
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
|
||
|
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
|
||
|
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
|
||
|
===========================================
|
||
|
? Count from (0-1023) (0) =
|
||
|
|
||
|
Codon totals over all genes
|
||
|
===========================================
|
||
|
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
|
||
|
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
|
||
|
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
|
||
|
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
|
||
|
===========================================
|
||
|
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
|
||
|
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
|
||
|
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
|
||
|
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
|
||
|
===========================================
|
||
|
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
|
||
|
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
|
||
|
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
|
||
|
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
|
||
|
===========================================
|
||
|
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
|
||
|
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
|
||
|
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
|
||
|
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
|
||
|
===========================================
|
||
|
? (y/n) (y) Save table in a file n
|
||
|
@42. TX 6 @ Codon usage method
|
||
|
|
||
|
Used to find protein coding regions. For each window length of
|
||
|
the sequence the routine measures the closeness to an expected codon
|
||
|
usage. Results are plotted for each of the three reading frames.
|
||
|
Stop and start codons are also marked on the plots. Has the highest
|
||
|
resolution of all such methods, but makes the strongest assumption,
|
||
|
i.e. that the codon usage is known. The latest version is described
|
||
|
in Methods in Enzymology 183, 193-211.
|
||
|
|
||
|
Choose whether to use an internal standard (i.e. part of the
|
||
|
current sequence known to code for a protein). If so define its end
|
||
|
points, and those of any others. Otherwise supply the name of a disk
|
||
|
file containing a table of codon usage. Tables are listed. Choose
|
||
|
between using the observed counts, or two types of normalisation:
|
||
|
normalised to give an average amino acid composition; normalised to
|
||
|
no amino acid bias. The first normalisation is clearly often
|
||
|
sensible, but the second removes valuable information and is only
|
||
|
made availabe for special circumstances. The final table will be
|
||
|
displayed, followed by the expected scores for window lengths 21, 31
|
||
|
and 41 codons. The scores for each of the three reading frames are
|
||
|
shown (they are logarithmic values) to help users choose a window
|
||
|
length for the analysis. Define a window length and plot interval.
|
||
|
Plotting will start.
|
||
|
|
||
|
The method was first described in Staden and McLachlan Nucl.
|
||
|
Acid Res. 10 141-156 (1982) and the following is a summary of the
|
||
|
initial ideas. The method makes the following main assumptions: the
|
||
|
codon preferences of all the genes in the sequence we are examining
|
||
|
are similar to those of the standard; the sequence is coding
|
||
|
throughout its whole length in only one reading frame; in the coding
|
||
|
frame the frequency of codon abc has a definite value Fabc
|
||
|
If we select a sequence a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1
|
||
|
then the probability of selecting it in each of the three frames is:
|
||
|
frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn
|
||
|
frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1
|
||
|
frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1
|
||
|
The probability that selection of a particular sequence was "caused"
|
||
|
by it being a coding sequence is:
|
||
|
P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3).
|
||
|
The program calculates these values for the given window length but
|
||
|
plots Log(P/(1-P)) for each of the three frames. At each point along
|
||
|
the sequence that the program has a point to plot it finds which of
|
||
|
the three values is highest and places a single point at the 50%
|
||
|
level for the corresponding frame. These single points will join to
|
||
|
form a solid line if one frame is consistently the highest scoring.
|
||
|
In addition stop codons are shown as short vertical lines that
|
||
|
bisect the 50% level of probability. When looking for coding regions
|
||
|
the user should look for solid horizontal lines at the 50% level
|
||
|
that are not interrupted by these short vertical lines.
|
||
|
|
||
|
Changes. Two normalisations are offered: 1) to remove all
|
||
|
amino acid compositional components from the tables, hence leaving
|
||
|
only the codon preference components. In general this is not
|
||
|
recommended as the amino acid component alone is often sufficient to
|
||
|
choose correctly between frames, but may be useful in special
|
||
|
circumstances. 2) to change the amino acid composition components to
|
||
|
give an average amino acid composition rather the the one contained
|
||
|
in the standard (this leaves the codon preference components
|
||
|
unchanged). In general this should be useful as the average amino
|
||
|
acid composition is likely to be closer to the composition of the
|
||
|
genes being hunted, than is that of the standard table of codon
|
||
|
preferences. The average composition is that recently published by
|
||
|
Argos, not the Dayhoff one that we have used before.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
|
||
|
? Menu or option number=42
|
||
|
Staden and McLachlan codon usage method
|
||
|
Codon tables for standards may be read from disk
|
||
|
or calculated from parts of the current sequence
|
||
|
? (y/n) (y) Define internal standard
|
||
|
Define standard
|
||
|
? start (0-1023) (0) =1
|
||
|
? end (2-1023) (1023) =1000
|
||
|
===========================================
|
||
|
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
|
||
|
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
|
||
|
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
|
||
|
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
|
||
|
===========================================
|
||
|
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
|
||
|
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
|
||
|
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
|
||
|
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
|
||
|
===========================================
|
||
|
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
|
||
|
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
|
||
|
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
|
||
|
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
|
||
|
===========================================
|
||
|
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
|
||
|
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
|
||
|
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
|
||
|
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
|
||
|
===========================================
|
||
|
Define standard
|
||
|
? start (0-1023) (0) =
|
||
|
Total codons in standard= 333.
|
||
|
X 1 Use observed frequencies
|
||
|
2 Normalize to average amino acid composition
|
||
|
3 Normalize to no amino acid bias
|
||
|
? 0,1,2,3 =2
|
||
|
===========================================
|
||
|
F TTT 19. S TCT 2. Y TAT 10. C TGT 3.
|
||
|
F TTC 6. S TCC 22. Y TAC 10. C TGC 8.
|
||
|
L TTA 2. S TCA 0. * TAA 0. * TGA 0.
|
||
|
L TTG 7. S TCG 2. * TAG 0. W TGG 8.
|
||
|
===========================================
|
||
|
L CTT 16. P CCT 16. H CAT 4. R CGT 10.
|
||
|
L CTC 12. P CCC 0. H CAC 10. R CGC 10.
|
||
|
L CTA 0. P CCA 0. Q CAA 8. R CGA 7.
|
||
|
L CTG 21. P CCG 16. Q CAG 18. R CGG 6.
|
||
|
===========================================
|
||
|
I ATT 19. T ACT 13. N AAT 16. S AGT 2.
|
||
|
I ATC 11. T ACC 17. N AAC 12. S AGC 15.
|
||
|
I ATA 3. T ACA 3. K AAA 22. R AGA 1.
|
||
|
M ATG 15. T ACG 3. K AAG 15. R AGG 1.
|
||
|
===========================================
|
||
|
V GTT 15. A GCT 21. D GAT 14. G GGT 10.
|
||
|
V GTC 7. A GCC 16. D GAC 20. G GGC 13.
|
||
|
V GTA 8. A GCA 8. E GAA 26. G GGA 14.
|
||
|
V GTG 11. A GCG 8. E GAG 13. G GGG 9.
|
||
|
===========================================
|
||
|
Span length 21 expected mean values: 4.8 -5.7 -4.8
|
||
|
Span length 31 expected mean values: 7.1 -8.4 -7.2
|
||
|
Span length 41 expected mean values: 9.5 -11.1 -9.5
|
||
|
? odd span length (11-101) (25) =41
|
||
|
? plot interval (1-11) (5) =
|
||
|
|
||
|
Missing graphics display here
|
||
|
|
||
|
@43. TX 6 @ Positional base preference method.
|
||
|
|
||
|
Used to find protein coding regions. For each window length of
|
||
|
the sequence the routine measures the closeness to an expected
|
||
|
pattern of base frequencies . Results are plotted for each of the
|
||
|
three reading frames. Stop and start codons are also marked on the
|
||
|
plots. The method is particularly useful for showing which reading
|
||
|
frame is the most likely to be coding. The latest version is
|
||
|
described in a forthcoming issue of Methods in Enzymology, but the
|
||
|
original ideas were given in Staden, R. Nucl. Acid Res. 12 551-567
|
||
|
(1984).
|
||
|
|
||
|
If dialogue is requested the following inputs are needed,
|
||
|
otherwise the standard analysis is performed. Choose between a
|
||
|
"global" standard, or a selected one. If the global standard is
|
||
|
selected the expected scores are displayed and the user asked to
|
||
|
define a span length and a plot interval. Then users choose between
|
||
|
plotting relative or absolute scores, and can reset the scaling
|
||
|
values employed for plotting. If the global standard is not
|
||
|
selected users must define a region of the sequence to use as a
|
||
|
standard, or they can read in a codon table from which the program
|
||
|
will calculate one. Then they can either, use the values observed in
|
||
|
this standard, or they can combine its values for the third
|
||
|
positions in codons, with those from the global standard. Next they
|
||
|
can give different weightings to each of the three positions in
|
||
|
codons.
|
||
|
|
||
|
In its original form the method took advantage of the uneven
|
||
|
use of amino acids by proteins and the structure of the genetic code
|
||
|
table and assumed that there is a typical ("global") amino acid
|
||
|
composition and no codon preference. The typical amino acid
|
||
|
composition is the average composition found by Argos (see below).
|
||
|
This composition and no codon preference determines the frequency of
|
||
|
each of the four bases in each of the three codon positions. This
|
||
|
3x4 frequency table shows unequal use of the bases and in particular
|
||
|
a marked use of G in position 1 and of A in position 2 (at the
|
||
|
expence of G). The routine slides a window along the sequence and
|
||
|
calculates a score for each of the three reading frames at each
|
||
|
window position. It assumes the sequence is coding throughout its
|
||
|
whole length and calcualtes the probability that it is coding in
|
||
|
each of the three frames. When tested against all the E. coli
|
||
|
sequences in the EMBL sequence library it correctly identified the
|
||
|
coding frame for 91% of window positions. (The E. coli sequences
|
||
|
were chosen only for technical reasons: I have no reason to think
|
||
|
the method would work less well on other organisms with roughly even
|
||
|
base composition.) The routine can plot either absolute or relative
|
||
|
values: ie absolute values are the values found by summing the
|
||
|
scores for each frame (say p1, p2 and p3), and the relative values
|
||
|
are then p1/(p1+p2+p3), p2/(p1+p2+p3) and p3/(p1+p2+p3).
|
||
|
|
||
|
At each point along the sequence that the program has a point
|
||
|
to plot it finds which of the three values is highest and places a
|
||
|
single point at the 50% level for the corresponding frame. These
|
||
|
single points will join to form a solid line if one frame is
|
||
|
consistently the highest scoring. In addition stop codons are shown
|
||
|
as short vertical lines that bisect the 50% level of probability.
|
||
|
When looking for coding regions the user should look for solid
|
||
|
horizontal lines at the 50% level that are not interrupted by these
|
||
|
short vertical lines. The absolute mean values expected on the
|
||
|
complement of the coding strand (and in the same frame) are 5% lower
|
||
|
than those on the coding strand but the relative values are the same
|
||
|
on both strands. Although the relative values give smoother plots
|
||
|
and tend to emphasize the coding frame they therefore, cannot be
|
||
|
used to decide which strand is coding. The absolute values plot
|
||
|
should be used for this purpose but bearing in mind the fact the the
|
||
|
differences between strands are quite small.
|
||
|
|
||
|
The method has been improved in two overall ways: first it now
|
||
|
allows users to define their own typical amino acid composition by
|
||
|
selecting a standard sequence from within the sequence they are
|
||
|
analysing or from a codon table; secondly it allows the inclusion of
|
||
|
third position preferences. Again these third position preferences
|
||
|
are defined by the use of an internal standard sequence. Not only
|
||
|
can users define their own standards but they can also give weights
|
||
|
to each of the three positions in codons. This allows different
|
||
|
emphasis to be used for each of the three positions. As an example
|
||
|
of its use, by giving, in turn, weights of 1.0, 0.0, 0.0, and 0.0,
|
||
|
1.0, 0.0, and finally 0.0, 0.0, 1.0, you could see the separate
|
||
|
contribution made by each of the three positions. It is also
|
||
|
possible to use the third position preferences with the values for
|
||
|
the first two positions taken from the "global" amino acid
|
||
|
composition. In all cases users may choose to plot absolute or
|
||
|
relative values. The expected scores are displayed before each
|
||
|
analysis and scales are drawn on the plots. At present this method
|
||
|
does not give probabilities of coding; it has only been tested for
|
||
|
its ability to choose the correct reading frame (see above). It
|
||
|
could be used to give probabilities of coding if was applied to all
|
||
|
known coding and non-coding sequences in the way that the uneven
|
||
|
positional base frequencies method was. It is designed to be used in
|
||
|
conjunction with this method. Note that the average amino
|
||
|
composition used to derive the base frequencies was changed on 17-
|
||
|
11-1988, to be the new average given by McCaldon and Argos in
|
||
|
Proteins 4 99-122 (1988). A further change is to allow users to
|
||
|
select their own scales for producing the plots. It can be helpful
|
||
|
if they want to emphasise or diminish certain features.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=D43
|
||
|
Positional base preferences method to find protein genes
|
||
|
Select standard source
|
||
|
X 1 Use global standard
|
||
|
2 Use internal standard
|
||
|
3 Use codon usage table
|
||
|
? Selection (1-3) (1) =2
|
||
|
Define region for standard
|
||
|
? start (0-8134) (0) =3171
|
||
|
? end (3172-8134) (8134) =4700
|
||
|
Select normalisation
|
||
|
X 1 Use observed frequencies
|
||
|
2 Combine with global standard
|
||
|
? Selection (1-2) (1) =1
|
||
|
T C A G Range
|
||
|
1 0.125 0.249 0.230 0.397 0.272
|
||
|
2 0.298 0.245 0.292 0.165 0.132
|
||
|
3 0.288 0.313 0.169 0.230 0.144
|
||
|
? (y/n) (y) Use 1.0 for positional weights
|
||
|
Give weights between 0.0 and 1.0
|
||
|
to each of the 3 codon positions
|
||
|
? Position 1 (0.00-1.00) (1.00) =
|
||
|
? Position 2 (0.00-1.00) (1.00) =
|
||
|
? Position 3 (0.00-1.00) (1.00) =
|
||
|
Expected scores per codon in each frame
|
||
|
0.136 0.122 0.123
|
||
|
? odd span length (31-101) (67) =
|
||
|
? plot interval (1-11) (5) =
|
||
|
? (y/n) (y) Plot relative scores
|
||
|
Scaling values:
|
||
|
Minimum maximum range
|
||
|
0.3121 0.3656 0.0382
|
||
|
? (y/n) (y) Leave scaling values unchanged
|
||
|
|
||
|
Graphics not shown
|
||
|
|
||
|
? Menu or option number=D43
|
||
|
Positional base preferences method to find protein genes
|
||
|
Select standard source
|
||
|
X 1 Use global standard
|
||
|
2 Use internal standard
|
||
|
3 Use codon usage table
|
||
|
? Selection (1-3) (1) =3
|
||
|
? File name of standard=atpase.cods
|
||
|
===========================================
|
||
|
F TTT 21. S TCT 33. Y TAT 15. C TGT 5.
|
||
|
F TTC 55. S TCC 40. Y TAC 40. C TGC 4.
|
||
|
L TTA 8. S TCA 7. * TAA 8. * TGA 0.
|
||
|
L TTG 19. S TCG 12. * TAG 1. W TGG 17.
|
||
|
===========================================
|
||
|
L CTT 22. P CCT 17. H CAT 6. R CGT 73.
|
||
|
L CTC 21. P CCC 4. H CAC 30. R CGC 23.
|
||
|
L CTA 1. P CCA 10. Q CAA 19. R CGA 5.
|
||
|
L CTG 168. P CCG 48. Q CAG 80. R CGG 3.
|
||
|
===========================================
|
||
|
I ATT 47. T ACT 14. N AAT 17. S AGT 8.
|
||
|
I ATC 98. T ACC 54. N AAC 52. S AGC 26.
|
||
|
I ATA 6. T ACA 7. K AAA 85. R AGA 0.
|
||
|
M ATG 75. T ACG 13. K AAG 28. R AGG 0.
|
||
|
===========================================
|
||
|
V GTT 67. A GCT 56. D GAT 41. G GGT 90.
|
||
|
V GTC 29. A GCC 53. D GAC 66. G GGC 66.
|
||
|
V GTA 49. A GCA 59. E GAA 101. G GGA 5.
|
||
|
V GTG 57. A GCG 64. E GAG 41. G GGG 8.
|
||
|
===========================================
|
||
|
Select normalisation
|
||
|
X 1 Use observed frequencies
|
||
|
2 Combine with global standard
|
||
|
? Selection (1-2) (1) =2
|
||
|
T C A G Range
|
||
|
1 0.177 0.211 0.277 0.336 0.159
|
||
|
2 0.271 0.238 0.310 0.182 0.128
|
||
|
3 0.242 0.301 0.168 0.289 0.132
|
||
|
? (y/n) (y) Use 1.0 for positional weights
|
||
|
Expected scores per codon in each frame
|
||
|
0.785 0.736 0.736
|
||
|
? odd span length (31-101) (67) =
|
||
|
? plot interval (1-11) (5) =
|
||
|
? (y/n) (y) Plot relative scores
|
||
|
Scaling values:
|
||
|
Minimum maximum range
|
||
|
0.3219 0.3519 0.0214
|
||
|
? (y/n) (y) Leave scaling values unchanged
|
||
|
|
||
|
Graphics not shown
|
||
|
@44. TX 6 @ Uneven positional base frequencies.
|
||
|
|
||
|
Used to find regions of a sequence that might be coding for a
|
||
|
protein. The method looks for sections of the sequence in which the
|
||
|
frequency at which each of the four bases occupies the three
|
||
|
positions in codons is nonrandom. The level of nonrandomness is
|
||
|
plotted on a scale that shows the probability that the sequence is
|
||
|
coding. At each position along a sequence the calculation gives the
|
||
|
same value for all six possible reading frames, so only one value is
|
||
|
plotted.
|
||
|
|
||
|
Define the window length and plot interval.
|
||
|
|
||
|
The results are plotted in a box divided by a horizontal line
|
||
|
marked "76%". 76% of coding regions achieve values above this line
|
||
|
and 76% of noncoding regions achieve scores below the line.
|
||
|
|
||
|
This method, first described in Staden R. Nucl. Acid Res. 12
|
||
|
551-567 1984, looks for uneven positional usage of bases in codons.
|
||
|
It looks through the sequence in one fixed phase and counts the
|
||
|
number of times each base apears in each of the three codon
|
||
|
positions: for each window position it counts A1,A2,A3 and C1,C2,C3
|
||
|
and G1,G2,G3 and T1,T2,T3 and calculates AMEAN=(A1+A2+A3)/3, and
|
||
|
similarly CMEAN, GMEAN and TMEAN; it then calculates ADIF=abs(A1-
|
||
|
AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly CDIF, GDIF and TDIF
|
||
|
to measure the differences between an even base usage for all
|
||
|
positions in the codons and the observed usage. The routine then
|
||
|
calculates the sum ADIF+CDIF+GDIF+TDIF and plots this value on the
|
||
|
following scale: the base level is such that no known window in a
|
||
|
coding region has a lower value, whereas 14% of windows in noncoding
|
||
|
sequences score below it. The top of the scale is not achieved by
|
||
|
any known noncoding region, but is reached by 16% of known coding
|
||
|
regions. The bar drawn across the plot corresponds to a level that
|
||
|
is exceeded by 76% of windows in known coding regions but is reached
|
||
|
by only 24% of windows in known noncoding regions. ie 76% of coding
|
||
|
windows score above and 76% of noncoding windows score below. This
|
||
|
is similar to Ficketts method but without the probabilities and
|
||
|
weightings from the Los Alamos sequence library: it is therefore
|
||
|
unbiased but may well give very similar results.
|
||
|
@45. TX 6 @ Codon improbability on base composition
|
||
|
|
||
|
Used to find regions of a sequence that might code for a
|
||
|
protein.
|
||
|
|
||
|
If dialogue is requested define a window length and plot
|
||
|
interval.
|
||
|
|
||
|
The idea of the method is, that of all sequence features that
|
||
|
we know, it is only coding regions that will give rise to codon
|
||
|
biases well above those expected from the base composition. If a
|
||
|
region of sequence shows sufficiently strong codon bias then we
|
||
|
conclude that it is coding for a protein. Using the multinomial
|
||
|
distribution we have derived a function to measure the improbability
|
||
|
of observing a set of codons from a sequence of the given
|
||
|
composition. Using the Poisson distribution we have worked out the
|
||
|
distribution of the improbability. The program plots the observed
|
||
|
improbability minus the expected improbability (the mean as
|
||
|
calculated from the Poisson distribution). The plots are presented
|
||
|
against a scale of units of standard deviation as measured from the
|
||
|
Poisson distribution. As with the other Staden and McLachlan method
|
||
|
the program puts an extra point at a fixed level for the highest of
|
||
|
the three probabilities; for this function this point is placed at
|
||
|
six standard deviations above the mean expected level. The top of
|
||
|
each plot corresponds to 12 standard deviations above the expected
|
||
|
level and the bottom corresponds to the expected value.
|
||
|
|
||
|
Analysis of the application of the method to the EMBL sequence
|
||
|
library indicates that the method does work for most sequences and
|
||
|
that the levels of improbability roughly correlate with levels of
|
||
|
expression. Coding regions will show high peaks in all three frames
|
||
|
making interpretation more difficult than for some of the other
|
||
|
methods.
|
||
|
@46. TX 6 @ Codon improbability on amino acid composition
|
||
|
|
||
|
Used to finds regions of a sequence that might code for a
|
||
|
protein.
|
||
|
|
||
|
If dialogue is requested define a window length and a plot
|
||
|
interval.
|
||
|
|
||
|
The idea of the method is, that of all sequence features that
|
||
|
we know, it is only coding regions that will give rise to codon
|
||
|
biases such that, for each amino acid, some codons are used far more
|
||
|
frequently than others. The method is independent of what the bias
|
||
|
actually is, requiring only that it is present. If a region of
|
||
|
sequence shows sufficiently strong codon bias then we conclude that
|
||
|
it is coding for a protein. Using the multinomial distribution we
|
||
|
have derived a function to measure the improbability of observing a
|
||
|
set of codons from a sequence of the given composition. Using the
|
||
|
Poisson distribution we have worked out the distribution of the
|
||
|
improbability. The program plots the observed improbability minus
|
||
|
the expected improbability (the mean as calculated from the Poisson
|
||
|
distribution). The plots are presented against a scale of units of
|
||
|
standard deviation as measured from the Poisson distribution. As
|
||
|
with the other Staden and McLachlan method the program puts an extra
|
||
|
point at a fixed level for the highest of the three probabilities;
|
||
|
for this function this point is placed at six standard deviations
|
||
|
above the mean expected level. The top of each plot corresponds to
|
||
|
12 standard deviations above the expected level and the bottom
|
||
|
corresponds to the expected value.
|
||
|
@47. TX 6 @ Shepherd RNY preference method
|
||
|
|
||
|
Used to find regions of a sequence that might code for a
|
||
|
protein. Based on the method of Shepherd (PNAS 78 1596-1600, 1981).
|
||
|
|
||
|
If dialogue is requested define a window length and plot
|
||
|
interval.
|
||
|
|
||
|
Shepherd has found that many genes have a preference for the
|
||
|
use of codons of the form RNY where R=purine, Y=pyrimidine and N=any
|
||
|
base. He has attributed this to being due to remants of a primitive
|
||
|
genetic code. The calculation is similar to that for the Staden and
|
||
|
McLachlan method, the p1's being simply the number of RNY codons
|
||
|
found in frame 1 etc and the P's being p/(p1+p2+p3).
|
||
|
@48. TX 6 @ Ficketts method
|
||
|
|
||
|
Used to find regions of a sequence that might code for a
|
||
|
protein. Based on the method of Fickett (Nucl. Acid Res.10 1982),
|
||
|
but plots values for fixed window lengths rather than over the whole
|
||
|
of open reading frames.
|
||
|
|
||
|
If dialogue is requested define a window length and plot
|
||
|
interval. The results are plotted in a box divided into three
|
||
|
horizontal strips.
|
||
|
|
||
|
Sections of the sequence with values plotted in the top strip
|
||
|
of the box are adjudged to be coding, those in the middle strip "no
|
||
|
decision", and those in the bottom "not coding".
|
||
|
|
||
|
The program performs the following calculations: let A1 = the
|
||
|
number of occurences of base A in position 1 of codons, A2 for
|
||
|
position 2 etc. Similarly for bases C,G and T. For each window
|
||
|
position calculate Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for
|
||
|
C,G and T to give 4 positional values. Also count the base
|
||
|
composition for the window to give Acomp, Ccomp etc. Fickett tested
|
||
|
each of these 8 parameters singly as to their ability to distinguish
|
||
|
coding from noncoding regions and arived at probabilities of coding
|
||
|
for the range of values each can take = Pcod. He also measured their
|
||
|
relative abilities and given weightings to each of the 8 parameters
|
||
|
= Pw. To calculate the "TESTCODE" for a window we first lookup the
|
||
|
Pcod for each of the calculated compositional and positional values
|
||
|
and then calculate TESTCODE=sum(Pcod*Pw). TESTCODE is plotted
|
||
|
relative to three levels of decision: the top division="coding", the
|
||
|
middle="no opinion" and the bottom division="non coding".
|
||
|
@49. TX 6 @ tRNA gene search.
|
||
|
|
||
|
Used to find segments of a sequence that might code for tRNAs.
|
||
|
Looks for potential cloverleaf forming structures and then for the
|
||
|
presence of expected conserved bases. Presents results graphically
|
||
|
or draws out the cloverleafs.
|
||
|
|
||
|
If dialogue is requested a large number of parameters need to
|
||
|
be given values, including some loop lengths, scores for each of the
|
||
|
four stems, and scores for the conserved bases.
|
||
|
|
||
|
The program was first described in Staden Nucl. Acid Res
|
||
|
817-825 (1980). The tRNA's that have been sequenced so far have
|
||
|
two characteristics that can be used to locate their genes within
|
||
|
long DNA sequences. Firstly they have a common secondary
|
||
|
structure - the cloverleaf - and secondly, particular bases
|
||
|
almost always appear at certain positions in the cloverleaf. The
|
||
|
cloverleaf is composed of four base-paired stems and four loops.
|
||
|
Three of the stems are of fixed length but the fourth, the
|
||
|
dhu stem which usually has four base pairs, sometimes has only
|
||
|
three. All of the loops can vary in size. The following
|
||
|
relationships between the stems in the cloverleaf are assumed in the
|
||
|
program: (a) there are no bases between one end of the aminoacyl
|
||
|
stem and the adjoining tuc stem; (b) there are two bases between
|
||
|
the aminoacyl stem and the dhu stem; (c) there is one base between
|
||
|
the dhu stem and the anticodon stem; (d) there are at least three
|
||
|
bases between the anticodon stem and the tuc stem. The program
|
||
|
looks first for cloverleaf structure and then, if required, for
|
||
|
conserved bases. The sizes of the loops, the number of basepairs in
|
||
|
the stems and the required conserved bases may all be specified
|
||
|
by the user. The process of looking for the presence of conserved
|
||
|
bases can reduce the number of potential structures found
|
||
|
considerably. The user may also specify that an intron may be
|
||
|
present in the anticodon loop.
|
||
|
|
||
|
The user may define a minimum number of base pairs for each
|
||
|
stem using the scoring system G-C, A-T=2 and G-T=1 and scores for
|
||
|
each of the conserved bases. Recommended values for the stem scores
|
||
|
are given by the prompts and the percentage conservation of the
|
||
|
conserved bases as found in the Nucl. Acid Res 1979 paper Gauss,
|
||
|
Gruter and Sprinzl are also given, but the user must decide which
|
||
|
bases are most likely to be conserved for the sequence being
|
||
|
examined. The output shows the position of the possible gene in the
|
||
|
sequence by a vertical line the height of which shows the number of
|
||
|
basepairs made in the stems. The cloverleaf structure is also drawn
|
||
|
but will scroll up off the screen. Output of the cloverleafs will
|
||
|
look like:
|
||
|
|
||
|
6942
|
||
|
A
|
||
|
A-U
|
||
|
A-U
|
||
|
G-C
|
||
|
A-U
|
||
|
U-A
|
||
|
A-U
|
||
|
U-A AAU
|
||
|
U UAUCU
|
||
|
AA A !!!!!
|
||
|
AAUG AUAGA A
|
||
|
U !!!! U UCA
|
||
|
C UUAC U
|
||
|
AA A
|
||
|
U-AA A
|
||
|
A-U
|
||
|
A-U
|
||
|
C-G
|
||
|
U-A
|
||
|
U A
|
||
|
U A
|
||
|
GUC
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
|
||
|
? Menu or option number=D49
|
||
|
tRNA search
|
||
|
? Maximum trna length (70-130) (92) =
|
||
|
? Aminoacyl stem score (0-14) (11) =
|
||
|
? Tu stem score (0-10) (8) =
|
||
|
? Anticodon stem score (0-10) (8) =
|
||
|
? D stem score (0-8) (3) =
|
||
|
? Minimum base pairing total (30-32) (32) =
|
||
|
? Minimum intron length (0-30) (0) =
|
||
|
? Minimum length for TU loop (4-12) (6) =
|
||
|
? Maximum length for TU loop (6-12) (9) =
|
||
|
? (y/n) (y) Skip search for conserved bases n
|
||
|
Give a score for each base, then a minimum total at the end
|
||
|
? Base 8, T is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 10, G is 95% conserved. Score (0-100) (0) =
|
||
|
? Base 11, Y is 96% conserved. Score (0-100) (0) =
|
||
|
? Base 14, A is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 15, R is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 21, A is 97% conserved. Score (0-100) (0) =
|
||
|
? Base 32, Y is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 33, T is 98% conserved. Score (0-100) (0) =
|
||
|
? Base 37, A is 91% conserved. Score (0-100) (0) =
|
||
|
? Base 48, Y is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 53, G is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 54, T is 95% conserved. Score (0-100) (0) =
|
||
|
? Base 55, T is 97% conserved. Score (0-100) (0) =
|
||
|
? Base 56, C is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 57, R is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 58, A is 100% conserved. Score (0-100) (0) =
|
||
|
? Base 60, Y is 92% conserved. Score (0-100) (0) =
|
||
|
? Base 61, C is 100% conserved. Score (0-100) (0) =
|
||
|
? Minimum total conserved base score (0-0) (0) =
|
||
|
? (y/n) (y) Plot results n
|
||
|
|
||
|
Searching
|
||
|
|
||
|
306
|
||
|
C
|
||
|
C-G
|
||
|
C-G
|
||
|
G-C
|
||
|
T-A
|
||
|
C-G
|
||
|
A-T
|
||
|
T+G AT
|
||
|
A ATACA
|
||
|
TTC T !!!! G
|
||
|
CTGT TATGG G
|
||
|
G ! ! T GA
|
||
|
C TAAA C
|
||
|
GCG C G
|
||
|
T+GA C
|
||
|
C-G C T
|
||
|
T+G A T
|
||
|
T-A G T
|
||
|
T-A G A
|
||
|
G G G C
|
||
|
A A G A
|
||
|
AGC T C
|
||
|
A T
|
||
|
C T
|
||
|
A
|
||
|
C T
|
||
|
|
||
|
|
||
|
@50. TX 7 @ Plot start codons
|
||
|
|
||
|
This function plots the positions of all start codons for each
|
||
|
of the three reading frames.
|
||
|
@51. TX 7 @ Plot stop codons
|
||
|
|
||
|
This function plots the positions of all stop codons for each
|
||
|
of the three reading frames.
|
||
|
@52. TX 7 @ Plot stop codons on the complementary strand
|
||
|
|
||
|
This function plots the positions of all stop codons for each
|
||
|
of the three reading frames on the complementary strand.
|
||
|
@53. TX 7 @ Plot stop codons on both strands
|
||
|
|
||
|
This function plots the positions of all stop codons for each
|
||
|
of the three reading frames on both strands.
|
||
|
@54. TX 5 @ Search for longest open reading frames
|
||
|
|
||
|
This function will report the positons of the ends of all
|
||
|
sections of sequence that contain no stop codons. All six reading
|
||
|
frames are examined. Results are presented in the form of an EMBL
|
||
|
feature table. Hence if the results are stored in a file by use of
|
||
|
"direct output to disk", the file can be used to translate the open
|
||
|
reading frames in a sequence. Note that in order for the file to be
|
||
|
used as a feature table it must include either EMBL or GenBank
|
||
|
headers, and a suitable "tail". The simplest header is the word
|
||
|
FEATURES starting in column 1 of the first line of the file. The
|
||
|
simplest tail is 2 empty lines at the end of the file. These lines
|
||
|
are not included when nip writes out results in feature table
|
||
|
format.
|
||
|
|
||
|
Define the minimum length of open reading frame to report (in
|
||
|
amino acids). Choose to search either or both strands. The program
|
||
|
displays the end points, the reading frame and strand.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
|
||
|
? Menu or option number=D54
|
||
|
Find open reading frames
|
||
|
? Minimum open frame in amino acids (5-1000) (30) =100
|
||
|
|
||
|
X 1 + strand only
|
||
|
2 - strand only
|
||
|
3 Both strands
|
||
|
? 0,1,2,3 =3
|
||
|
|
||
|
FT CDS 1 831 1 831
|
||
|
FT CDS 1540 2853 1 1314
|
||
|
FT CDS 3130 4242 1 1113
|
||
|
FT CDS 5761 6114 1 354
|
||
|
FT CDS 6187 6711 1 525
|
||
|
FT CDS 1766 2077 2 312
|
||
|
FT CDS 2078 2446 2 369
|
||
|
FT CDS 4136 5500 2 1365
|
||
|
FT CDS 1335 1637 3 303
|
||
|
FT CDS 2844 3194 3 351
|
||
|
FT CDS 6819 7238 3 420
|
||
|
FT CDS 2073 1711 C 1 363
|
||
|
FT CDS 2469 2149 C 1 321
|
||
|
FT CDS 6542 6144 C 3 399
|
||
|
|
||
|
@55. TX 8 @ Search for E. coli promoter (general)
|
||
|
|
||
|
Searches for E coli promoter like sequences using a standard
|
||
|
weight matrix. The positions of the matches are plotted. No dialogue
|
||
|
is required.
|
||
|
|
||
|
The method was first described in Staden R. Nucl. Acid Res. 12
|
||
|
505-519 1984. This search uses a weight matrix taken from the
|
||
|
frequency tables contained in Hawley, D. K. and McClure, R., nar 11
|
||
|
2237-2255 (1983). The weight matrix is divided into 3 sections that
|
||
|
are separated by varying sizes of gap: the - 35 region, the -10 and
|
||
|
the +1 region. The algorithm first looks for a sufficiently good
|
||
|
-35 region, then for the best -10 region within range and then for
|
||
|
the best +1 region within range of the -10; each separate region
|
||
|
must score above the lowest known score for the corresponding
|
||
|
section. The gap penalty is then applied and two plots produced: one
|
||
|
with gap penalties, one without. Scaling is such that no known
|
||
|
promoter scores below the bottom level and no known promoter scores
|
||
|
above the top level when the weight matrix is applied.
|
||
|
|
||
|
Two other functions also look for E. coli promoters: 92 looks
|
||
|
for sites on the complementary strand and 93 looks for individual
|
||
|
-35 and -10 regions and plots them on a scale such the top is the
|
||
|
highest known value +10% and the bottom is the lowest known -10%
|
||
|
weights for E. coli promoters
|
||
|
-35 region:
|
||
|
P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26
|
||
|
|
||
|
107109109110110110110110110111111110111112112112112112112112112112112112112
|
||
|
T 41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34
|
||
|
C 22 27 18 29 20 14 20 12 22 23 16 25 10 43 7 6 11 18 60 8 25 23 23 17 20
|
||
|
A 28 38 30 37 35 56 42 42 37 42 39 18 25 26 2 6 2 72 26 50 26 34 25 26 31
|
||
|
G 16 11 29 19 21 18 13 21 9 19 24 26 29 29 11 6 88 3 11 17 15 21 26 21 27
|
||
|
-10 region:
|
||
|
P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5
|
||
|
112112112112112112112112112112112112112112112112112112112
|
||
|
T 35 28 28 27 39 51 34 43 26 31 89 3 49 15 19108 31 29 21
|
||
|
C 34 21 24 27 12 25 20 25 20 27 10 2 16 14 22 3 13 16 30
|
||
|
A 20 39 33 33 39 23 29 16 23 19 2106 29 66 57 1 35 23 31
|
||
|
G 23 24 27 25 22 13 29 28 43 35 11 1 18 17 14 0 33 24 30
|
||
|
+ region:
|
||
|
P -2 -1 1 2 3 4 5 6 7 8 9 10
|
||
|
86 88 85 88 88 88 88 88 88 88 88 88
|
||
|
T 16 22 2 42 27 23 20 25 27 15 16 29
|
||
|
C 29 49 4 25 25 13 18 22 17 17 16 17
|
||
|
A 20 9 45 16 24 25 28 24 24 32 35 26
|
||
|
G 21 8 37 5 12 27 22 17 20 24 21 16
|
||
|
Notes: E. coli promoters have been shown to contain 2 regions of
|
||
|
conserved sequence located about 10 and 35 bases upstream of the
|
||
|
transcription startsite. These are TATAAT and TTGACA with an allowed
|
||
|
spacing of 15 to 21 bases between. The spacing with maximum
|
||
|
efficiency was 17 bases and all but 12 of the 112 sequences could be
|
||
|
aligned with a separation of 17 +or-1 bases. The standard promoter
|
||
|
has spacing 7 and 17 bases between the startsite and the -10 region,
|
||
|
and the -10 and -35 regions, respectively. The spacing between the
|
||
|
-10 region and the startsite is usually 6 or 7 bases but varies
|
||
|
between 4 and 8 bases. There is an AT rich region of 8 to 10 bases
|
||
|
upstream of the -35 region. Iniation with a purine is highly
|
||
|
prefered with G being used if A is not present.
|
||
|
Gap penalties:
|
||
|
15 0.02 (only exists as mutant)
|
||
|
16 0.2
|
||
|
17 1.0
|
||
|
18 0.2
|
||
|
19 0.05 (guess)
|
||
|
20 0.02 (guess)
|
||
|
21 0.01 (guess)
|
||
|
@56. TX 8 @ Search for E. coli promoter (general) strand
|
||
|
|
||
|
This function searches for E. Coli promoters on the
|
||
|
complementary strand of the sequence. See the notes on option 55.
|
||
|
@57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10)
|
||
|
|
||
|
This function searches separately for the -35 and -10
|
||
|
sequences of an E. coli promoter. See the notes on option 55.
|
||
|
@58. TX 8 @ Search for procaryotic ribosome binding sites
|
||
|
|
||
|
This function searches for the 5' ends of prokaryotic genes
|
||
|
using an unusual weight matrix. The search is relatively slow
|
||
|
because the matrix is 101 bases in length. No dialogue is required.
|
||
|
|
||
|
The method was first described in Staden Nucl. Acid Res. 12
|
||
|
505-519 1984. This actually looks for more than a ribosome binding
|
||
|
site as is explained below. This uses their weight matrix w101 of
|
||
|
Stormo and Schneider (NAR 10 2971-3024, 1982) which with a value of
|
||
|
2 finds all gene starts in their library.
|
||
|
P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36
|
||
|
T 5 1 -3 9-14 7 15 -5 3-16-17 4 18 5 -3 -1 2 4 5 -5 7 8 -5-15 6
|
||
|
C-21 -6-11-21 0 8 -7-12 -1 1 0-19 12 -3 -1 10 2 -8 -5-11 8 1 23 6 -5
|
||
|
A 7 -2 13 -2 -8-13-18 5 0 -5 13 8-15 9 -4 -7 9 0 -8-11-10 -6 -7 -5 -6
|
||
|
G -6 -9 -7 0 8-16 -4 -2-16 1 -4 8-14 5 11-13-24 3 7 22-11 -9-15 10 -4
|
||
|
|
||
|
P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11
|
||
|
T 3 4 16 -4 7 11 -4 -1 12 8 10 -1 1 8 2-10-16 11 1 -3 16 -3-36 -8-27
|
||
|
C 2-14 -3 -8-10-21 2 0 -2 -1-11 -3 -1 5-11 -4 7 0-14 6 -8-20 -7-36-44
|
||
|
A-12 -1-27 -3 -6 0-12 -3 -4 -7 14 -2 -4 -6 0 12 5 -9 0-11-11 10 8 2 8
|
||
|
G 4 -5 -6 -3 -1 -4 -1 -4-15 0-14 3 10-19 -3-10 -7 -7 7 1 -8 -6 15 21 42
|
||
|
|
||
|
P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
|
||
|
T-53-27-26-23 2 -7-14-40-28 0-53 75-62-20-40-10-35 -5-12 -1 4 14-23 7 -2
|
||
|
C-15-50-43-35-38-29-29 1 -9 1-87-55-64-45 11-22-14-20-15-15-10-22 -5 2 6
|
||
|
A 0 -3 -5 4-20-11 5 6 -2-15 66-69-52 -5 -4 6 8-24 -7-10 -7 13 14 -9-18
|
||
|
G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15
|
||
|
|
||
|
P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
|
||
|
T-26 1 4 -7 3 -4 0-10 8-18 7-22-21 8 4 -3 -6 7 -8 1 -5-16-16 7 -6
|
||
|
C 6 -8 19 -7 9 -3 17 -2 3 -9 5 22 22 8 -1 1 18 6 11-10 -8 7 10 0 7
|
||
|
A 14-12-42 1 -5 -4-32 12-10 20 -6 -1 3 -4 4-10 -1 -2-14 11 14 -3 2-13 5
|
||
|
G-23 -7 -1 -6-17 -4 0-15-14 -4-17-10 -5-13 -8 10-13-13 9 -4 -3 10 2 4 -8
|
||
|
|
||
|
P 40
|
||
|
T 0
|
||
|
C 14
|
||
|
A 5
|
||
|
G-21
|
||
|
These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl.
|
||
|
Acid Res. 10 2997- 3011, 1982. They report that this matrix gives a
|
||
|
score of at least 2 for all gene starts in their library whereas all
|
||
|
other sequences score 1 or less.
|
||
|
@29. TX 1 @ Reverse and complement the sequence
|
||
|
|
||
|
Reverses and complements the current active region of the
|
||
|
sequence.
|
||
|
@60. TX 7 @ Search using a dinucleotide weight matrix
|
||
|
|
||
|
This function performs searches for short sequence motifs
|
||
|
using an appropriate dinucleotide weight matrix. In addition it can
|
||
|
be used to create or modify weight matrices. In order to perform a
|
||
|
search the only input required is the name of the file containing
|
||
|
the weight matrix. The results can be presented graphically or
|
||
|
listed. The graphical presentation will draw line at the position of
|
||
|
any matches found; the height of the line is proportional to the
|
||
|
score. The method is identical to that using weight matrices derived
|
||
|
from nucleotide frequencies, except that here we use the frequencies
|
||
|
of dinucleotides.
|
||
|
|
||
|
For a search, select "use weight matrix", supply the name of
|
||
|
the file containing the weight matrix, and choose between having
|
||
|
results plotted or listed. If dialogue is requested when the
|
||
|
function is selected users can alter the cutoff score employed.
|
||
|
|
||
|
To create a weight matrix several steps are involved. A file
|
||
|
containing an alignment of known motifs is required. (This file must
|
||
|
be created before the current option is selected. The format is a
|
||
|
follows: each sequence is written on a separate line with at least
|
||
|
one space at the beginning; each sequence is terminated by a space
|
||
|
character, and can be followed by a name. The sequences must be
|
||
|
aligned.) Supply the name of the file of aligned sequences. The
|
||
|
program reads and displays the sequences. Choose between "summing
|
||
|
logs of weights" or summing weights (i.e. whether to multiply or add
|
||
|
weights). If logs are used all scores will be negative. Choose if
|
||
|
all positions in the set of aligned sequences should be used or if a
|
||
|
mask should be applied. If so selected, define a mask as a string of
|
||
|
symbols, in which symbol - means ignore and any other symbol means
|
||
|
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
|
||
|
|
||
|
The program will calculate weights as the frequencies of the
|
||
|
dinucleotides at each unmasked position in the set of aligned
|
||
|
sequences. These weights are then applied to the set of aligned
|
||
|
sequences to give a range of "observed" scores. The mean and
|
||
|
standard deviation of these scores is displayed. The user is asked
|
||
|
to supply several values to be used when the weight matrix is
|
||
|
applied to other sequences: a cutoff score (by default, the mean
|
||
|
minus 3 standard deviations), a top score for scaling graphical
|
||
|
results (by default, the mean plus 3 standard deviations), and a
|
||
|
position to identify (this means that if a particular base within
|
||
|
the motif is used as a "landmark", such as the A of the AG in splice
|
||
|
acceptor sites, then its position will be marked in plots). All
|
||
|
these values are stored along with the weight matrix. Finally supply
|
||
|
the name of a file to contain the weight matrix.
|
||
|
|
||
|
Weight matrices can be "rescaled" using a set of aligned
|
||
|
sequences in much the same ways as a matrix is created. The purpose
|
||
|
is to redefine the cutoff scores, and rescaling does not alter any
|
||
|
other values in the weight matrix file.
|
||
|
|
||
|
The methods have always had to deal with the problem of zeroes
|
||
|
in the matrices. The current versions employ "Laplaces Law of
|
||
|
Succession" in which 1 is added to each term.
|
||
|
Typical dialogue follows.
|
||
|
|
||
|
? Menu or option number=D60
|
||
|
|
||
|
Motif search using dinucleotide weight matrix
|
||
|
X 1 Use weight matrix
|
||
|
2 Make weight matrix
|
||
|
3 Rescale weight matrix
|
||
|
? 0,1,2,3 = 2
|
||
|
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
|
||
|
|
||
|
|
||
|
1 AGCGTGACTCTTCCCGGAA HIS1
|
||
|
2 GAGGTGACTCACTTGGAAG HIS1
|
||
|
3 CGGATGACTCTTTTTTTTT HIS3
|
||
|
4 ACAGTGACTCACGTTTTTT HIS4
|
||
|
5 GTCGTGACTCATATGCTTT ARG3
|
||
|
6 TGAATGACTCACTTTTTGG ARG4
|
||
|
7 TTCTTGACTCGTCTTTTCT CPA1
|
||
|
8 CGAATGACTCTTATTGATG CPA2
|
||
|
9 AGAATGACTAATTTTACTA TRP5
|
||
|
10 TCGTTGACTCATTCTAATC TRP3
|
||
|
11 TTGCTGACTCATTACGATT TRP2
|
||
|
12 GAGATGACTCTTTTTCTTT IV1
|
||
|
13 GCGATGATTCATTTCTCTG IV2
|
||
|
14 TAGATGACTCAGTTTAGTC LEU1
|
||
|
15 TAAGTGACTCAGTTCTTTC LEU4
|
||
|
16 ATGATGACTCTTAAGCATG ILS1
|
||
|
Length of motif 18
|
||
|
? (y/n) (y) Sum logs of weights n
|
||
|
? (y/n) (y) Use all motif positions n
|
||
|
x means use, - means ignore
|
||
|
e.g. xx-x---x-x means use positions 1,2,4,8,10
|
||
|
? Mask=----XXXXXXXX--------
|
||
|
Applying weights to input sequences
|
||
|
1 89.000 AGCGTGACTCTTCCCGGA
|
||
|
2 91.000 GAGGTGACTCACTTGGAA
|
||
|
3 93.000 CGGATGACTCTTTTTTTT
|
||
|
4 90.000 ACAGTGACTCACGTTTTT
|
||
|
5 94.000 GTCGTGACTCATATGCTT
|
||
|
6 91.000 TGAATGACTCACTTTTTG
|
||
|
7 81.000 TTCTTGACTCGTCTTTTC
|
||
|
8 90.000 CGAATGACTCTTATTGAT
|
||
|
9 75.000 AGAATGACTAATTTTACT
|
||
|
10 97.000 TCGTTGACTCATTCTAAT
|
||
|
11 97.000 TTGCTGACTCATTACGAT
|
||
|
12 93.000 GAGATGACTCTTTTTCTT
|
||
|
13 69.000 GCGATGATTCATTTCTCT
|
||
|
14 90.000 TAGATGACTCAGTTTAGT
|
||
|
15 90.000 TAAGTGACTCAGTTCTTT
|
||
|
16 90.000 ATGATGACTCTTAAGCAT
|
||
|
Top score 97.000 Bottom score 69.000
|
||
|
Mean 88.750 Standard deviation 7.319
|
||
|
Mean minus 3.sd 66.794 Mean plus 3.sd 110.706
|
||
|
? Cutoff score (-999.00-9999.00) (66.79) =
|
||
|
? Top score for scaling plots (66.79-999.00) (110.71) =
|
||
|
? Position to identify (0-18) (1) =
|
||
|
? Title=GCN4 DI WTS
|
||
|
? Name for new weight matrix file=3.WTS
|
||
|
|
||
|
? Menu or option number=D60
|
||
|
Motif search using dinucleotide weight matrix
|
||
|
X 1 Use weight matrix
|
||
|
2 Make weight matrix
|
||
|
3 Rescale weight matrix
|
||
|
? 0,1,2,3 =
|
||
|
? Motif weight matrix file=3.WTS
|
||
|
GCN4 DI WTS
|
||
|
? Cutoff score (-9999.00-9999.00) (66.79) =40
|
||
|
? (y/n) (y) Plot results n
|
||
|
15 42.00 CAACCCGCTCACCGACAA
|
||
|
29 42.00 ACAACAGCTCACCCACGC
|
||
|
93 46.00 AGCCTTCCTCATCGCTGC
|
||
|
153 40.00 CAGCGGAATCAAACTTAA
|
||
|
408 42.00 CGATGGATTCAAGTTGAA
|
||
|
469 47.00 TTAGGAACTCCCTCTGTC
|
||
|
493 60.00 AAGCTGAATCTTAGCAGC
|
||
|
530 43.00 CGGAGGGCTCAGTGAGGG
|
||
|
542 47.00 TGAGGGACTACTGCACCA
|
||
|
678 41.00 CTTCTGCTTCAAAGAGTT
|
||
|
709 47.00 AATATGACGGCGCACGTG
|
||
|
848 54.00 GTCAGAACTCAAATCAGT
|
||
|
940 49.00 CCGTTGACGACCTCCGCA
|
||
|
992 42.00 TGGGCACCTCACACCAAG
|
||
|
|
||
|
|
||
|
@61. TX 8 @ Search for eukaryotic ribosome binding sites
|
||
|
|
||
|
Searches for eukaryotic ribosome binding sites using
|
||
|
weightings derived from Sargan,Gregory,Butterworth febs let 147
|
||
|
133-136 1982. No dialogue is required. First described in Staden
|
||
|
Nucl. Acid Res. 12 505-519 1984.
|
||
|
mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET
|
||
|
147 133-136 1982
|
||
|
P -7 -6 -5 -4 -3 -2 -1 1 2 3
|
||
|
102102102102102102102102102102
|
||
|
T 19 24 31 12 0 18 5 0102 0
|
||
|
C 20 15 32 65 5 42 52 0 0 0
|
||
|
A 50 27 27 19 86 36 34102 0 0
|
||
|
G 6 29 12 6 11 6 11 0 0102
|
||
|
VIRAL ONLY
|
||
|
P -7 -6 -5 -4 -3 -2 -1 1 2 3
|
||
|
41 41 41 41 41 41 41 41 41 41
|
||
|
T 14 12 16 4 2 13 9 0 41 0
|
||
|
C 7 3 13 17 7 9 14 0 0 0
|
||
|
A 15 10 6 10 27 15 9 41 0 0
|
||
|
G 5 16 6 10 5 4 9 0 0 41
|
||
|
The Sargan et al paper puts forward the hypothesis that there is an
|
||
|
interaction between some mRNA leader sequences and a highly conserved
|
||
|
structure in the 18S rRNA of eukaryotic ribosomes. The attempt to
|
||
|
substantiate the hypothesis includes a table of base frequencies for
|
||
|
sequences immediately 5' to start codons. They examined 102
|
||
|
sequences and I have used the base frequencies they found as a weight
|
||
|
matrix for searching for eukaryotic gene starts. I don't yet know how
|
||
|
good this method is. The viral sequences were found to be slightly
|
||
|
different but the separate table shown here is not used in the
|
||
|
program.
|
||
|
@62. TX 8 @ Search for splice junctions
|
||
|
|
||
|
Used to search for mRNA splice junctions using a weight
|
||
|
matrix. The default weight matrix is still that derived from the
|
||
|
paper of Mount (Nucl. Acids Res. 10, 459-472). However users may
|
||
|
employ their own tables. By default the positions of possible
|
||
|
junctions will be plotted rather than listed. The diagram splits
|
||
|
the donor plot into 3 horizontal boxes so that all the sites marked
|
||
|
in any box are from the same reading frame. The acceptor plot
|
||
|
appears above the donor plot and is split in an equivalent way. So
|
||
|
sites marked as donors and acceptors in equivalent boxes are
|
||
|
compatible. i.e. donors from donor box 1 are compatible with
|
||
|
acceptors from acceptor box 1, etc. Of course it is the combination
|
||
|
of reading frame and splice sites that really matters, and donors
|
||
|
from box 1 can be compatible with acceptors in box 3 if the reading
|
||
|
frame switches.
|
||
|
|
||
|
If dialogue is selected users can employ their own file of
|
||
|
weights (see below for the format), can change the cutoff scores,
|
||
|
and can elect to have the results listed rather than plotted. Listed
|
||
|
results show the position (of the last or first base in the exon),
|
||
|
the frame and the matching sequence. The frequency table shown
|
||
|
below is used as a default weight matrix and AG and GT are
|
||
|
obligatory at the appropriate positions. The plots are scaled so
|
||
|
that the top of scale is the highest value achieved by a junction
|
||
|
sequence in the set used to compile the frequency table, and the
|
||
|
bottom of the scale is the lowest value achieved by a junction
|
||
|
sequence in the set used to compile the frequency table.
|
||
|
|
||
|
In the light of current knowledge it would be sensible for
|
||
|
users to use the weight matrix search option (20) to create matrices
|
||
|
that define more specific splice junctions. If so it is important
|
||
|
that the positions "marked" are the last base in the donor exon and
|
||
|
the first base in the acceptor exon. To make a weight matrix
|
||
|
suitable for use with this function follow the instructions for
|
||
|
option 20 and create files for both donor and acceptor sites. Then
|
||
|
concatenate the two matrix files with the donor file first. Note
|
||
|
that any positions in the weight matrix that are 100% conserved will
|
||
|
be made obligatory (normally the AG and GT).
|
||
|
|
||
|
Mount donors redone 16-4-91
|
||
|
12 3 -16.085 -7.500
|
||
|
P -2 -1 0 1 2 3 4 5 6 7 8 9
|
||
|
N 136 136 136 136 136 136 136 136 136 136 136 136
|
||
|
T 28 8 15 17 0 136 9 16 7 84 30 36
|
||
|
C 41 60 16 7 0 0 3 13 3 17 28 39
|
||
|
A 40 56 89 12 0 0 83 91 12 23 53 33
|
||
|
G 27 12 16 100 136 0 41 16 114 12 25 28
|
||
|
Mount acceptors redone 16-4-91
|
||
|
18 15 -26.142 -14.400
|
||
|
P -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
|
||
|
N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
|
||
|
T 58 50 57 59 67 56 58 49 47 66 64 31 34 0 0 11 41 31
|
||
|
C 21 28 34 25 29 33 35 32 42 40 33 25 74 0 0 23 28 41
|
||
|
A 17 11 11 18 7 17 12 23 15 3 10 29 5 113 0 24 21 21
|
||
|
G 17 24 11 11 10 7 8 9 9 4 6 28 0 0 113 55 23 20
|
||
|
@63. TX 7 @ Search using a weight matrix (complementary)
|
||
|
|
||
|
This function searches the complementary strand of the
|
||
|
sequence using a weight matrix. Many motifs can bind to either
|
||
|
strand of the DNA and this function allows users to search the
|
||
|
complementary strand without having to change the orientation of the
|
||
|
sequence. See option 20 for more details.
|
||
|
@64. TX 3 @ Plot observed-expected word frequencies
|
||
|
|
||
|
This option is designed to examine the abundances of short
|
||
|
words in a sequence to see if particular ones are either under or
|
||
|
over represented. It compares the observed and expected frequencies
|
||
|
and plots them along the sequence. There has been some work on the
|
||
|
relative amounts of CG dinucleotides in eukaryotic sequences (eg
|
||
|
Bird, Nature 321, 209-213 (1986)) and this new routine can be used
|
||
|
to examine such biases, or any others that might be interesting.
|
||
|
|
||
|
The user selects a word - say CG -, a window length, and a
|
||
|
maximum and mininum scale for plotting the results. The program
|
||
|
examines each sucessive window length along the sequence, with each
|
||
|
window overlapping the previous one by windowlength-1. The program
|
||
|
counts the base frequencies in each window, and the number of
|
||
|
occurrences of the chosen word within the window. Using the base
|
||
|
frequencies it calculates an expected number of occurrences for the
|
||
|
chosen word (simply by multiplying the relevant frequencies). It
|
||
|
plots observed-expected, and hence will show regions that are rich
|
||
|
or depleted in the chosen word. The longest allowed word is 9
|
||
|
characters, but the calculation of the expected frequencies becomes
|
||
|
less appropriate as the word length increases above 2.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
|
||
|
? Menu or option number=D64
|
||
|
Plot composition differences (obs-exp))
|
||
|
Default String=CG
|
||
|
? String=
|
||
|
? odd span length (3-401) (101) =
|
||
|
? plot interval (1-20) (5) =
|
||
|
? Maximum plot value (-6.31-25.25) (6.31) =
|
||
|
? Minimum plot value (-25.25-6.31) (-6.31) =
|
||
|
|
||
|
Missing graphics display here
|
||
|
|
||
|
@65. TX 9 @ Search for polya sites
|
||
|
|
||
|
Simply searches for the sequence AATAAA (Proudfoot and
|
||
|
Brownlee Nature 263, 211-214, 1982) and marks it with a short
|
||
|
vertical line.
|
||
|
@66. TX 1 @ Interconvert t and u
|
||
|
|
||
|
This function interconverts T and U characters in the active
|
||
|
sequence i.e between DNA and RNA.
|
||
|
@67. TX 7 @ Search for patterns of motifs
|
||
|
|
||
|
This option searches for patterns of motifs. Patterns can be
|
||
|
defined interactively or read from files. Results can be displayed
|
||
|
in several ways in both graphical and textual form. Used to create
|
||
|
pattern files for searching libraries. The option is extremely
|
||
|
flexible and consequently the following documentation is quite
|
||
|
lengthy. However the routine is capable of searching for almost any
|
||
|
known pattern. In addition the flexibility does not necessitate
|
||
|
difficulty of use, and the userinterface has been simplified
|
||
|
considerably since the methods were first published.
|
||
|
|
||
|
Users should refer to the "typical dialogue" shown below for
|
||
|
the most helpful information on using the program.
|
||
|
|
||
|
There are currently four ways to display the matching
|
||
|
patterns: 1=each individual motif and its position is listed; 2=all
|
||
|
the sequence between, and including the two outermost motifs is
|
||
|
listed; 3=graphical, with a vertical line marking the position of
|
||
|
the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
|
||
|
field if the motif name, the FROM and TO fields denote the ends of
|
||
|
the match, and the DESCRIPTION field is "Program".
|
||
|
|
||
|
When it is defined for the first time a pattern must be
|
||
|
entered interactively at the keyboard, but the pattern description
|
||
|
can be saved to a file. This file can be used for all subsequent
|
||
|
searches.
|
||
|
|
||
|
When defining a pattern interactively select a motif class and
|
||
|
the program will request the required inputs.
|
||
|
|
||
|
The program gives each motif an identifying name and number.
|
||
|
For motifs other than the first, a range of allowed positions must
|
||
|
be defined (Note that sets of motifs included using the OR operator
|
||
|
will all be given the same range, and so the program will only
|
||
|
request range values for the first motif in any such set). To
|
||
|
specify the allowed range for a motif the user must supply the
|
||
|
following: the identifying number of the motif, relative to which
|
||
|
the current motifs positions are to be defined (termed the
|
||
|
"reference motif"); a "relative start position" and a range. The
|
||
|
relative start position can be negative or positive. A negative
|
||
|
start position means that although the reference motif is searched
|
||
|
for first, the current motif can be found to its left. A zero
|
||
|
relative start position means their left ends are superimposed. The
|
||
|
default start position is to butt-joint the motif to righthand end
|
||
|
of the "reference motif". The range is "the number of extra
|
||
|
positions" that the motif can take.
|
||
|
|
||
|
The program will display the probability of finding each
|
||
|
motif. These values are presented in the following form: .1234E-5
|
||
|
means 0.1234 times 10 to the power -5.
|
||
|
|
||
|
After the pattern has been defined, the program will type a
|
||
|
description of it on the screen. It will then allow the user to give
|
||
|
an overall cutoff score and overall probability cutoff.
|
||
|
|
||
|
Typical dialogue for all the different motif classes is
|
||
|
displayed below.
|
||
|
|
||
|
? Menu or option number=67
|
||
|
Pattern searcher
|
||
|
? (y/n) (y) Read pattern from keyboard
|
||
|
X 1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =
|
||
|
? Motif name=Ematch
|
||
|
? String=AA
|
||
|
Probability of score 2.0000 = 0.595E-01
|
||
|
X 1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =2
|
||
|
? Motif name=AAA
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-1) (1) =
|
||
|
? Relative start position (-1000-1000) (3) =
|
||
|
? Number of extra positions (0-1000) (0) =
|
||
|
? string=AAA
|
||
|
? Minimum matches (1.00-3.00) (3.00) =2
|
||
|
Probability of score 2.0000 = 0.149E+00
|
||
|
1 Exact match
|
||
|
X 2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =3
|
||
|
? Motif name=T'S
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-2) (2) =
|
||
|
? Relative start position (-1000-1000) (4) =
|
||
|
? Number of extra positions (0-1000) (0) =
|
||
|
? String=TTT
|
||
|
? Minimum score (0.00-108.00) (108.00) =72
|
||
|
Probability of score 72.0000 = 0.258E+00
|
||
|
1 Exact match
|
||
|
2 Percentage match
|
||
|
X 3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =4
|
||
|
? Motif name=GCN4
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-3) (3) =
|
||
|
? Relative start position (-1000-1000) (4) =
|
||
|
? Number of extra positions (0-1000) (0) =
|
||
|
? Weight matrix file name=GCN4
|
||
|
GCN4 FROM WEIGHTS 17-11-87
|
||
|
Probability of score -22.0020 = 0.139E-02
|
||
|
1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
X 4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =5
|
||
|
? Motif name=GCN4
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-4) (4) =
|
||
|
? Relative start position (-1000-1000) (20) =
|
||
|
? Number of extra positions (0-1000) (0) =
|
||
|
? Weight matrix file name=GCN4
|
||
|
GCN4 FROM WEIGHTS 17-11-87
|
||
|
Probability of score -22.0020 = 0.606E-03
|
||
|
1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
X 5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =6
|
||
|
? Motif name=LOOP
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-5) (5) =
|
||
|
? Relative start position (-1000-1000) (20) =
|
||
|
? Number of extra positions (0-1000) (0) =
|
||
|
? Stem length (1-60) (6) =
|
||
|
? Minimum loop length (-6-60) (0) =
|
||
|
? Maximum loop length (0-60) (0) =5
|
||
|
? Minimum score (1.00-12.00) (12.00) =10
|
||
|
Probability of score 10.0000 = 0.598E-02
|
||
|
1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
X 6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =7
|
||
|
? Motif name=Tstep
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-6) (6) =
|
||
|
? (y/n) (y) Relative to 5 prime end
|
||
|
? Relative start position (-1000-1000) (1) =
|
||
|
? Number of extra positions (0-1000) (0) =
|
||
|
? String=TTT
|
||
|
? Step (1-20) (3) =
|
||
|
Probability of score 3.0000 = 0.367E-01
|
||
|
1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
X 7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =8
|
||
|
? Motif name=REPEAT
|
||
|
X 1 And
|
||
|
2 Or
|
||
|
3 Not
|
||
|
? 0,1,2,3 =
|
||
|
? Number of reference motif (1-7) (7) =
|
||
|
? Relative start position (-1000-1000) (4) =
|
||
|
? Number of extra positions (0-1000) (0) =2
|
||
|
? Repeat length (1-60) (6) =
|
||
|
? Minimum gap (0-60) (0) =
|
||
|
? Maximum gap (0-60) (0) =4
|
||
|
? Minimum score (1.00-6.00) (6.00) =5
|
||
|
Probability of score 5.0000 = 0.554E-02
|
||
|
1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
X 8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =9
|
||
|
? (y/n) (y) Save pattern in a file N
|
||
|
|
||
|
Pattern description
|
||
|
|
||
|
Motif 1 named Ematch is of class 1
|
||
|
Which is an exact match to the string
|
||
|
AA
|
||
|
Motif 2 named AAA is of class 2
|
||
|
which is a match of score 2. to the string
|
||
|
AAA
|
||
|
and the 5 prime base can take positions 3 to 3
|
||
|
relative to the 5 prime end of motif 1
|
||
|
It is anded with the previous motif.
|
||
|
Motif 3 named T'S is of class 3
|
||
|
which is a match of score 72. to the string
|
||
|
TTT
|
||
|
and the 5 prime base can take positions 4 to 4
|
||
|
relative to the 5 prime end of motif 2
|
||
|
It is anded with the previous motif.
|
||
|
Motif 4 named GCN4 is of class 4
|
||
|
Which is a match to a weight matrix with score -22.002
|
||
|
and the 5 prime base can take positions 4 to 4
|
||
|
relative to the 5 prime end of motif 3
|
||
|
It is anded with the previous motif.
|
||
|
Motif 5 named GCN4 is of class 5
|
||
|
Which is a match to the complement of a weight matrix with score -22.002
|
||
|
and the 5 prime base can take positions 20 to 20
|
||
|
relative to the 5 prime end of motif 4
|
||
|
It is anded with the previous motif.
|
||
|
Motif 6 named LOOP is of class 6
|
||
|
Which is a stem-loop structure with stem length 6 and score 10.
|
||
|
The loop can have sizes 0 to 5
|
||
|
and the 5 prime base can take positions 20 to 20
|
||
|
relative to the 5 prime end of motif 5
|
||
|
It is anded with the previous motif.
|
||
|
Motif 7 named Tstep is of class 7
|
||
|
Which is an exact match to the string
|
||
|
TTT
|
||
|
with a step size of 3
|
||
|
and the 5 prime base can take positions 1 to 1
|
||
|
relative to the 5 prime end of motif 6
|
||
|
It is anded with the previous motif.
|
||
|
Motif 8 named REPEAT is of class 8
|
||
|
Which is a repeat with repeat length 6 and score 5.
|
||
|
The loop-out can have sizes 0 to 4
|
||
|
and the 5 prime base can take positions 4 to 6
|
||
|
relative to the 5 prime end of motif 7
|
||
|
It is anded with the previous motif.
|
||
|
Probability of finding pattern = 0.2348E-14
|
||
|
Expected number of matches = 0.5100E-09
|
||
|
? Maximum pattern probability (0.00-1.00) (1.00) =
|
||
|
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
|
||
|
Select display mode
|
||
|
X 1 Motif by motif
|
||
|
2 Inclusive
|
||
|
3 Graphical
|
||
|
4 EMBL feature table
|
||
|
? 0,1,2,3,4 =4
|
||
|
Searching
|
||
|
|
||
|
|
||
|
Total matches found 0
|
||
|
|
||
|
Menus and their numbers are
|
||
|
m0 = This menu
|
||
|
m1 = General
|
||
|
m2 = Screen control
|
||
|
m3 = Statistical analysis of content
|
||
|
m4 = Structures and repeats
|
||
|
m5 = Translation and codons
|
||
|
m6 = Gene search by content
|
||
|
m7 = Prokaryotic signal search
|
||
|
m8 = Eukaryotic signal search
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
? Menu or option number=67
|
||
|
Pattern searcher
|
||
|
? (y/n) (y) Read pattern from keyboard
|
||
|
X 1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =
|
||
|
? Motif name=Arun
|
||
|
? String=AAAAAA
|
||
|
Probability of score 6.0000 = 0.210E-03
|
||
|
X 1 Exact match
|
||
|
2 Percentage match
|
||
|
3 Cut-off score and score matrix
|
||
|
4 Cut-off score and weight matrix
|
||
|
5 Complement of weight matrix
|
||
|
6 Inverted repeat or stem-loop
|
||
|
7 Exact match, defined step
|
||
|
8 Direct repeat
|
||
|
9 Pattern complete
|
||
|
? 0,1,2,3,4,5,6,7,8,9 =9
|
||
|
? (y/n) (y) Save pattern in a file N
|
||
|
|
||
|
Pattern description
|
||
|
|
||
|
Motif 1 named Arun is of class 1
|
||
|
Which is an exact match to the string
|
||
|
AAAAAA
|
||
|
Probability of finding pattern = 0.2103E-03
|
||
|
Expected number of matches = 0.1522E+01
|
||
|
? Maximum pattern probability (0.00-1.00) (1.00) =
|
||
|
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
|
||
|
Select display mode
|
||
|
X 1 Motif by motif
|
||
|
2 Inclusive
|
||
|
3 Graphical
|
||
|
4 EMBL feature table
|
||
|
? 0,1,2,3,4 =4
|
||
|
Searching
|
||
|
|
||
|
|
||
|
FT Arun 1582 1587 Program
|
||
|
FT Arun 3160 3165 Program
|
||
|
FT Arun 4204 4209 Program
|
||
|
FT Arun 5691 5696 Program
|
||
|
FT Arun 6710 6715 Program
|
||
|
Total matches found 5
|
||
|
Minimum and maximum observed scores 6.00 6.00
|
||
|
|
||
|
|
||
|
These methods allow users to define and search for complex
|
||
|
patterns of motifs defined as single objects. The programs allow
|
||
|
individual DNA motifs to be defined in eight different ways, and
|
||
|
protein motifs in six. Motifs are combined, using the logical
|
||
|
operators AND, OR and NOT, to describe a pattern. The pattern also
|
||
|
specifies the ranges of allowed relative separations of the
|
||
|
individual motifs.
|
||
|
|
||
|
First some definitions.
|
||
|
|
||
|
A MOTIF is a contiguous subsequence of fixed length. At its
|
||
|
simplest it could be a single definite base or amino acid; a more
|
||
|
complex motif might be better represented as a consensus or a weight
|
||
|
matrix; two more-abstract types of motif are direct and inverted
|
||
|
repeats.
|
||
|
|
||
|
A PATTERN is a higher order of structure defined by a list of
|
||
|
motifs. The motifs in a pattern are combined using the logical
|
||
|
operators AND, OR and NOT. The list also defines the allowed
|
||
|
relative separations of the motifs. In the current versions of the
|
||
|
programs up to 50 motifs can be combined into a single pattern. So
|
||
|
using these definitions there are two differences between motifs and
|
||
|
patterns: 1) the distances between all elements of a motif are
|
||
|
fixed, but the separations of parts of patterns can vary; 2) all
|
||
|
characters in a motif are defined using the same method (class), but
|
||
|
different parts of a pattern can be defined in completely different
|
||
|
ways.
|
||
|
|
||
|
Each motif can be represented in 9 ways (known as the motif
|
||
|
class):
|
||
|
|
||
|
MOTIF CLASSES
|
||
|
CLASS DESCRIPTION
|
||
|
1 Exact match to a short defined sequence. The IUB symbols
|
||
|
can be used for DNA sequences.
|
||
|
2 Percentage match to a defined short sequence. In nucleic acids,
|
||
|
the IUB symbols can be used.
|
||
|
3 Match to a defined sequence, using a score matrix and cutoff
|
||
|
score. The DNA matrix (see option 18) gives scores to IUB symbols
|
||
|
depending on their level of redundancy. MDM78 is used for proteins.
|
||
|
4 Match to a weight matrix with cutoff score.
|
||
|
5 As class 4 but on the complementary strand.
|
||
|
6 Inverted repeat or stem-loop. Fixed stem length, range of
|
||
|
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
|
||
|
7 Exact match to short sequence but with a defined step size.
|
||
|
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
|
||
|
cutoff score, and score matrix (for protein sequences MDM78 and
|
||
|
for nucleic acids an identity matrix).
|
||
|
9 Membership of a set. A list of sets of allowed amino acids for
|
||
|
each position in the motif. The sets are separated by commas(,).
|
||
|
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
|
||
|
acids in which one of I,V or L must be found in the first position,
|
||
|
then anything in the next two positions, D,E,K or R in the fourth
|
||
|
position and F,Y,W,I,L,V or M in the fifth. This class only applies
|
||
|
to protein sequences because for nucleic acids "membership of a
|
||
|
set"
|
||
|
can be achieved using IUB symbols.
|
||
|
|
||
|
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
|
||
|
nucleic acids.
|
||
|
|
||
|
|
||
|
Class 1: exact match.
|
||
|
|
||
|
The motif is defined by a short sequence, which for nucleic
|
||
|
acids, may include IUB symbols. All symbols must match.
|
||
|
|
||
|
Class 2: percentage match
|
||
|
|
||
|
The motif is defined by a short sequence, which for nucleic
|
||
|
acids, may include IUB symbols. The minimum number of matching
|
||
|
characters must also be specified.
|
||
|
|
||
|
Class 3: match using a score matrix
|
||
|
|
||
|
The motif is defined by a short sequence, which for nucleic
|
||
|
acids, may include IUB symbols. The motif is not compared directly
|
||
|
with the sequence to count the number of matching characters.
|
||
|
Instead a matrix is used to provide a score for all possible pairs
|
||
|
of characters. The motif score for any position along the sequence
|
||
|
is the sum of the scores found by looking-up the scores for each
|
||
|
pair of aligned characters. A match is declared if some minimum
|
||
|
score is achieved.
|
||
|
|
||
|
Class 4: weight matrix
|
||
|
|
||
|
The motif is defined by a table of values (called weights or
|
||
|
scores). The table gives a score for finding each possible character
|
||
|
at each position along the length of the motif. It therefore has
|
||
|
dimension motif-length x character-set-size, and allows us to give
|
||
|
different scores for each character at each position. It is
|
||
|
equivalent to having a different score matrix for each position
|
||
|
along the motif, and provides the most flexible and specific method
|
||
|
of defining motifs. The weight matrices are created by program NIP
|
||
|
option 20 and stored as files. The file contains the values for each
|
||
|
position, as well as an overall minimum score. There are two ways in
|
||
|
which these values can be used to calculate an overall score for any
|
||
|
section of the sequence. The simplest way is to add the values in
|
||
|
the file. (This means that the highest possible score can be
|
||
|
calculated by adding the top value at each column position, and the
|
||
|
lowest by adding the bottom value.) The normal way of using the
|
||
|
values in the file is as follows. First the programs divide the
|
||
|
values in each column by the column total so that they sum to 1.0
|
||
|
Then the natural logs of these values are used as scores. When the
|
||
|
matrix is applied to a sequence these logarithmic values are summed
|
||
|
(which is of course equivalent to multiplying the frequencies).
|
||
|
Note that using the natural logs of the frequencies as weights and
|
||
|
adding them means that the overall cutoff score must be less than
|
||
|
zero, whereas if the original values in the weight matrix file are
|
||
|
added, the cutoff score will be greater than zero. The search
|
||
|
routines therefore decide whether the user wants to add values or
|
||
|
multiply frequencies by examining the value of the cutoff score: it
|
||
|
will add if the cutoff is greater than zero and add logs of
|
||
|
frequencies if it is less than zero. Hence we effectively get two
|
||
|
motif classes in one. The program NIP, when creating weight matrix
|
||
|
files, will ask the user whether the scores should be added or
|
||
|
multiplied. If the values in the table have been defined without
|
||
|
using a set of aligned sequences it is easier for the user to choose
|
||
|
a cutoff score if the values are added.
|
||
|
|
||
|
Class 5: complement of weight matrix
|
||
|
|
||
|
The motif is defined by a weight matrix, but the program
|
||
|
searches for its complement.
|
||
|
|
||
|
Class 6: inverted repeat, or stem-loop
|
||
|
|
||
|
The motif is defined by a repeat length, a minimum score and a
|
||
|
range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
|
||
|
The loop sizes are defined by a minimum and maximum distance from
|
||
|
the 3' end of the stem. For a stem-loop these will be positive
|
||
|
numbers. For example to define a stem of length 8 and loop sizes
|
||
|
varying from 3 to 5, the stem would be set to 8, the minimum start
|
||
|
distance to 3 and the maximum to 5. To define an inverted repeat the
|
||
|
minimum distance will be negative. For example stem length=9,
|
||
|
minimum distance=-9, and maximum distance=-8 will find inverted
|
||
|
repeats of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be
|
||
|
found, the first having a base at its centre, the second having
|
||
|
none.
|
||
|
|
||
|
Class 7: exact match, defined step size.
|
||
|
|
||
|
The motif is defined by a short sequence, which for nucleic
|
||
|
acids, may include IUB symbols. All symbols must match. The class
|
||
|
differs from class 1 in that searches will move in steps of some
|
||
|
given size. For example we could search for a certain codon and use
|
||
|
a step size of 3 and hence keep in a single reading frame.
|
||
|
|
||
|
Class 8: direct repeat
|
||
|
|
||
|
The motif is defined by a repeat length, a minimum score and a
|
||
|
range of loop sizes. The scores are defined using MDM78 for protein
|
||
|
sequences and an identity matrix for nucleic acids. The loop sizes
|
||
|
are defined by a minimum and maximum distance from the 3' end of the
|
||
|
stem.
|
||
|
|
||
|
Class 9: membership of a set
|
||
|
|
||
|
This motif class is for protein sequences. It is defined by
|
||
|
lists of allowed amino acids for each position in the motif, and a
|
||
|
cut-off score. Positions at which any amino acid can occur are left
|
||
|
blank. All allowed amino acids for each position give a score of 1.
|
||
|
The motifs can be defined in two ways: either typed at the keyboard
|
||
|
or read in as a weight-matrix-like file. When the motif is defined
|
||
|
at the keyboard the sets of allowed amino acids are separated by
|
||
|
commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length
|
||
|
5 amino acids in which one of I,V or L must be found in the first
|
||
|
position, then anything in the next two positions, D,E,K or R in the
|
||
|
fourth position and F,Y,W,I,L,V or M in the fifth. To specify that
|
||
|
the whole motif must match a score of 3 would be required (i.e. one
|
||
|
of the allowed amino acids must be found for each of the three
|
||
|
defined positions). If the motif is read from a file the file must
|
||
|
have been written by program NIP, or have been saved by the pattern
|
||
|
searching routines. If the user elects to save a pattern, and it
|
||
|
includes class 9 motifs typed at the keyboard, then the program will
|
||
|
save the class 9 motifs as weight matrix files. Therefore it will
|
||
|
request file names for each motif of this class. If the motif given
|
||
|
above as an example were saved the weight matrix file would have 5
|
||
|
columns. The first column would contain zeroes except for the I, V
|
||
|
and L rows which would be set to 1; the next two columns would all
|
||
|
be zero; the next would be zero except for the D,E,K and R rows
|
||
|
which would be 1; the final column would contain 1's in rows
|
||
|
F,Y,W,I,L,V and M, with the rest zero.
|
||
|
|
||
|
The logical operator (AND, OR or NOT) used to add each motif
|
||
|
to the pattern is specified by preceding the class number by the
|
||
|
letters A, O or N. A = AND, O = OR, N = NOT. The default is A, so
|
||
|
N2 means include, using the NOT operator, a class 2 motif; O2 means
|
||
|
include, using the OR operator, a class 2 motif; both A2 and 2 mean
|
||
|
include, using the AND operator, a class 2 motif.
|
||
|
|
||
|
Range setting.
|
||
|
|
||
|
The motifs in a pattern are numbered according to their order
|
||
|
in the list. Apart from the first motif in a pattern all motifs are
|
||
|
given a range of allowed positions relative to a motif further up
|
||
|
the list. For example suppose we have a pattern defined by A AND B
|
||
|
AND C AND D. Motif A can occur anywhere, but B must have its range
|
||
|
of allowed positions defined relative to the position of motif A,
|
||
|
and C's positions can be defined relative to either A or B,
|
||
|
depending on which is most convenient, and likewise D's positions
|
||
|
can be relative to A or B or C.
|
||
|
|
||
|
Notice that the positions of motifs can be defined relative to
|
||
|
more than one motif. Suppose we have a pattern consisting of motifs
|
||
|
A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
|
||
|
10 residues right of B, and also C is never more than 15 residues
|
||
|
from A. Then it is quite consistent with the methods to include
|
||
|
motif C into the pattern twice using the AND operator: once relative
|
||
|
to A and once relative to B. This will define the relative spacing
|
||
|
and the ORDER of the motifs in the pattern. (If we simply defined
|
||
|
the position of C relative to A it could be found to the left of B).
|
||
|
|
||
|
Motifs combined together using the OR operator are all given
|
||
|
the same range. For example suppose we had a pattern A AND (B OR C)
|
||
|
AND (D OR E), then B and C each have the same range, and D and E
|
||
|
also have the same range as one another. The range for D and E can
|
||
|
be relative to A or to B.
|
||
|
|
||
|
Motifs cannot have their ranges defined relative to motifs
|
||
|
that are included using the NOT operator. For example if we had the
|
||
|
pattern A NOT B AND C, then the range for C can only be defined
|
||
|
relative to motif A.
|
||
|
|
||
|
Speed can be gained by arranging the order of the motifs so
|
||
|
that those higher up the list are of types that can be searched for
|
||
|
rapidly and that are also unlikely to be found.
|
||
|
|
||
|
Motifs combined by the OR operator are alternatives: if any
|
||
|
one of a set of motifs combined by the OR operator is found, then a
|
||
|
match is declared. All alternatives will be reported. For example if
|
||
|
we had a pattern defined by A AND (B OR C), then all places where A
|
||
|
occurs and B is found within range, and all places where A is found
|
||
|
and C is found within range will be reported. A typical use would be
|
||
|
where we might allow a motif to appear on either strand of the DNA
|
||
|
sequence. For example a weight matrix representing the heatshock
|
||
|
element could be used in a pattern which included heatshock as a
|
||
|
motif class 4 combined using the OR operator with heatshock as a
|
||
|
motif class 5.
|
||
|
|
||
|
The probability calculations are performed for each motif as
|
||
|
it is defined. If an overall probability cut-off is given the
|
||
|
calculation is repeated for each match found. To achieve maximum
|
||
|
searching speed do not give an overall probability cut-off. Overall
|
||
|
cut-off scores should only be used if the motif classes used are
|
||
|
compatible.
|
||
|
|
||
|
There are currently several ways to display the matches: 1 =
|
||
|
each motif and its position is listed; 2 = all the sequence between
|
||
|
the two outermost motifs is listed; 3 = graphical, with a spike
|
||
|
marking the position of the leftmost motif. The library versions
|
||
|
also give entry names, and a one line title; in addition they can be
|
||
|
used to produce aligned families of sequences. When this mode of
|
||
|
output is selected the program will write a separate file for each
|
||
|
match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the
|
||
|
name of the entry in the library. The matching sequence will be
|
||
|
written out so that the spacing between motifs is constant, and set
|
||
|
to the maximum allowed by the pattern definition. Any gaps will be
|
||
|
filled with dashes (-). If the individual sequences were
|
||
|
subsequently written one above the other they should line up so that
|
||
|
all motifs are in register. There two types of output of this sort:
|
||
|
one, option 4, writes out whole sequences, the other, option 5,
|
||
|
writes out only the sequences between the two outermost motifs. If
|
||
|
the individual sequences were subsequently written one above the
|
||
|
other they should line up so that all motifs are in register. There
|
||
|
two types of output of this sort: one, option 4, writes out whole
|
||
|
sequences, the other, option 5, writes out only the sequences
|
||
|
between the two outermost motifs. Note that for option 4 users are
|
||
|
asked to type the position of the first motif, and the reason for
|
||
|
this is explained below. Consider a pattern found in several
|
||
|
sequences. Consider only the first motif in the pattern and suppose
|
||
|
that it was found in different positions in these sequences. Say
|
||
|
that of these positions the one furthest from the left end was
|
||
|
position 100. Then, in order to ensure that all the sequences would
|
||
|
align, we must specify that motif 1 must start at position 100. Any
|
||
|
sequences in which motif 1 started nearer to the left end than
|
||
|
position 100 would be padded accordingly. These modes of output
|
||
|
should only be used when the position of each motif is defined
|
||
|
relative to its immediate neighbour.
|
||
|
|
||
|
The pattern descriptions can be saved to files. These files
|
||
|
can be used instead of typing definitions again at the keyboard. As
|
||
|
the files are annotated, they can easily be changed using system
|
||
|
editors, and the modified versions used to define the variant
|
||
|
patterns for the programs.
|
||
|
|
||
|
Use of lists of entry names
|
||
|
|
||
|
The two programs that operate on libraries have the ability to
|
||
|
restrict their searches to subsets of the libraries. This does not
|
||
|
require sublibraries to be created but instead is achieved by using
|
||
|
files containing a list of the entry names of sequences. The user
|
||
|
may choose to search only those entries on the list or,
|
||
|
alternatively to search all but those on the list (i.e. in the
|
||
|
latter case the list contains the names of those to be excluded).
|
||
|
The programs can search libraries that have indexes and those that
|
||
|
do not. If a list of names for inclusion is used, then the search
|
||
|
will be faster if the index is present. In all other circumstances
|
||
|
the whole library will be read. The list must be in library order
|
||
|
except when it is used to include entries, and an index is
|
||
|
available. The list must contain each entry name on a separate
|
||
|
line, with the name starting in column 1 of the line. ie there must
|
||
|
be no spaces at the start of the line. The list of entry names can
|
||
|
be produced by the keyword searches of nip, pip, etc as long as the
|
||
|
listings produced have a space character separating the entry name
|
||
|
from the entry description. This will depend on how well the library
|
||
|
reformatting programs work. For example swissprot entry names tend
|
||
|
to run into the beginning of the descriptions, but other libraries
|
||
|
are generally OK.
|
||
|
|
||
|
One use of the programs is to look for patterns that we
|
||
|
already know about, but in new sequences. However it is hoped that
|
||
|
they will also be useful for finding new motifs. For example several
|
||
|
known control regions in nucleic acid sequences consist of
|
||
|
particular direct or inverted repeats; the inclusion of direct and
|
||
|
inverted repeats as motif classes makes it possible to find
|
||
|
previously unknown motifs of these types. Using these new programs
|
||
|
we can ask questions like: "are there any inverted or direct repeats
|
||
|
near to sections of sequence that contain both a CCAAT box and a
|
||
|
TATA box?"; and to search for such things throughout the libraries.
|
||
|
In addition, the mode of output in which all the sequence between
|
||
|
the two outermost motifs found is printed out, allows us to extract
|
||
|
sequences and examine them in more detail for further common
|
||
|
subsequences. For example we might want to collect together all the
|
||
|
sequences between putative CCAAT and TATA boxes.
|
||
|
|
||
|
A further use of the inverted repeat motif class is the
|
||
|
following. If a regulatory sequence in DNA is poorly defined but
|
||
|
also an inverted repeat, then it might be an advantage to specify it
|
||
|
both as a consensus sequence and a superimposed inverted repeat. In
|
||
|
this way two weak definitions can be combined to produce a stronger
|
||
|
pattern.
|
||
|
|
||
|
Given only a few examples of a motif it should be possible to
|
||
|
perform initial searches using a class 3 motif, and then, using
|
||
|
plausible matching sequences, create a more specific weight matrix
|
||
|
for the same motif.
|
||
|
|
||
|
If motifs are combined with the first motif using the OR
|
||
|
operator they will be ignored until all permutations that include
|
||
|
the first motif have been looked for. The whole search will then be
|
||
|
repeated, in turn, for each of those motifs that are combined with
|
||
|
the first motif using the OR operator. An interesting consequence
|
||
|
of this is that the program can be used, without change, to compare
|
||
|
any newly determined sequence with all known individual motifs. We
|
||
|
achieve this by having a pattern in which all known relevant motifs
|
||
|
are combined using the OR operator. If we ask to use this pattern
|
||
|
with a sequence, the program will automatically compare each
|
||
|
individual motif in the pattern with the whole length of the
|
||
|
sequence. As the number of known motifs grows this should become an
|
||
|
increasingly useful standard procedure.
|
||
|
|
||
|
The NOT operator is obviously useful for making sure
|
||
|
particular motifs are not present, but it can also be used to
|
||
|
bracket the levels of matches found. We may want a degree of match
|
||
|
that lies between two limits - binding should occur, but not too
|
||
|
strongly; or base-pairs should form, but not too many. We can
|
||
|
specify this by asking for a match with a low score, in combination
|
||
|
with a match and a high score, both for the same motif, but with the
|
||
|
high score included using the NOT operator.
|
||
|
|
||
|
The algorithm is designed to find all sections of a sequence
|
||
|
that satisfy the pattern rather than only the best match.
|
||
|
Particularly if some of the motifs in a pattern are less well
|
||
|
defined than others, this can often result in the same region of a
|
||
|
sequence being reported as having several matches, but which only
|
||
|
vary in the positions of the weakest motifs.
|
||
|
|
||
|
General remarks on motif searching
|
||
|
|
||
|
Generally motifs are short subsequences that are thought to be
|
||
|
associated with particular functions in some known sequences. Often
|
||
|
we search for them to try to understand or interpret other
|
||
|
sequences. Sometimes we search for motifs and patterns to test a
|
||
|
hypothesis about their role: are they found in the expected
|
||
|
positions in the expected sequences. In doing so we should remember
|
||
|
that, in both proteins and nucleic acids, what we are really looking
|
||
|
for is a particular three dimensional structure with certain
|
||
|
affinities for other structures, and that we are assuming that the
|
||
|
sequence of the motif alone defines the 3D structure we searching
|
||
|
for. The overall structure may be completely different to those in
|
||
|
which the motif is functional, and hence the motif may have a
|
||
|
different shape or be inaccessible. We should be aware of the
|
||
|
importance of the context in which a motif is found. Where does it
|
||
|
lie relative to the overall structure, is it accessible, is the
|
||
|
three dimensional spacing between it and other motifs correct? For
|
||
|
example, is it on the same side of the double helix, and the correct
|
||
|
distance from some other motif? How does context affect our
|
||
|
assessment of the significance of finding a motif? Finding false
|
||
|
mammalian mRNA splice junctions in non-coding sequences is far less
|
||
|
important than finding false sites in pre-mRNA sequences, but
|
||
|
finding them in the correct places is most important! In other
|
||
|
words, it is often the case that when we are searching for a motif
|
||
|
that is known to be necessary for some function, then a positive
|
||
|
result in the form of a match in the required position, is more
|
||
|
important than a high background of matches in the wrong positions.
|
||
|
Being able to write down the probability of finding a motif in a
|
||
|
random sequence tells us how well it is defined. In nucleic acids
|
||
|
the DNA may contain many superimposed types of information such as
|
||
|
those concerned with histone phasing, protein coding or mRNA
|
||
|
secondary structure. These overlapping "codes" may interfere with
|
||
|
one another causing matches to motifs to be poorer than expected.
|
||
|
In general we will only have a limited number of examples of the
|
||
|
motif and we do not know how representative they are.
|
||
|
|
||
|
Sequences have superimposed functions: some parts may be of
|
||
|
general structural importance and give rise to an overall framework,
|
||
|
and other parts give specificity and hence are not common; we may
|
||
|
want to use a set of aligned sequences to define a motif, but want
|
||
|
to use only the framework positions. Alternatively we may want to
|
||
|
pick out only those parts of a set of aligned sequences that give a
|
||
|
particular property, and to ignore other similarities that are due
|
||
|
to some other property and which could obscure the pattern we are
|
||
|
interested in. It is possible to apply a mask to a set of aligned
|
||
|
sequences in order to give weight to selected positions only. The
|
||
|
ability to define a mask allows certain positions to be used in the
|
||
|
motif and others to be ignored, and yet still permits the use of a
|
||
|
set of aligned sequences to calculate weights. The mask is requested
|
||
|
and applied by the program and results in the masked positions being
|
||
|
zero in the weight matrix. The mask is defined in the following way.
|
||
|
Suppose we had a motif of length 15, then the mask x--x--xx-x will
|
||
|
give zero weights to positions 2,3,5,6 and 9 (note it is the dashes
|
||
|
(-) that are significant and that positions 1,4,7,8,10,11,12,13,14
|
||
|
and 15 will be non-zero). Of course the same set of sequences could
|
||
|
be used with several alternative masks in order to extract different
|
||
|
features and create corresponding weight matrices.
|
||
|
|
||
|
The programs are described in Staden,R. CABIOS 4, 53-60, 1988;
|
||
|
Staden,R. CABIOS 5, 89-96, 1989, and Methods in Enzymology 183,
|
||
|
193-211 (1990).
|
||
|
@ end of help
|
||
|
|
||
|
|
||
|
|