gde_linux/CORE/xylem/prot2nuc.doc

124 lines
5.7 KiB
Plaintext

prot2nuc update 10 Aug 94
NAME
prot2nuc - reverse translates protein into nucleic acid
SYNOPSIS
prot2nuc [-ln -gn] < input > output
DESCRIPTION
prot2nuc reads a file containing an amino acid sequence
and writes the corresponding reverse translated nucleic acid
sequence, using the standard IUPAC-IUB ambiguity codes to output.
The amino acid sequence may contain internal stop '*' characters.
That is, all legal amino acid characters will be processed.
-ln print n amino acids/codons per line. (default = 25)
-gn number the amino acid sequence every n amino acids/codons.
(defalut = 5)
If l is not evenly divisible by g, the defaults are used.
input - If the first line of the file begins with '>' or ';',
input will be read as the standard .wrp (Pearson) format,
such as that produced by getob:
>name
sequence lines
Otherwise, it will be assumed that the file ONLY contains
sequence, and all legal IUPAC/IUB DNA characters will be
read as sequence.
output - The output begins with a header, listing the both
1 and 3 letter amino acid codes [J. Biol. Chem. 243, 3557-3559
(1968)], as well as the nucleic acid ambiguity codes [Cornish-
Bowden (1985) Nucl. Acids Res. 13:3021-3030.]. The amino acid
sequence, along with its reverse translation, are then printed on
lines of l amino acids/codons, numbering every g amino acids/codons.
Non-ambiguous nucleotides appear capitalized, while ambiguous
nucleotides are in lowercase. A sample output file appears below:
PROT2NUC Version 8/10/94
IUPAC-IUP AMINO ACID SYMBOLS
[J. Biol. Chem. 243, 3557-3559 (1968)]
Phe F Leu L Ile I
Met M Val V Ser S
Pro P Thr T Ala A
Tyr Y His H Gln Q
Asn N Lys K Asp D
Glu E Cys C Trp W
Arg R Gly G STOP *
Asx B Glx Z UNKNOWN X
IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE
[Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.]
Symbol Meaning | Symbol Meaning
------------------------------------+---------------------------------
G Guanine | k G or T
A Adenine | s G or C
C Cytosine | w A or T
T Thymine | h A or C or T
U Uracil | b G or T or C
r Purine (A or G) | v G or C or A
y Pyrimidine (C or T) | d G or T or A
m A or C | n G or A or T or C
pI39
5 10 15 20
M E K K S L A A L S F L L L L V L F V A
ATGGArAArAArTCnCTnGCnGCnCTnTCnTTyCTnCTnCTnCTnGTnCTnTTyGTnGCn
AGyTTr TTrAGy TTrTTrTTrTTr TTr
25 30 35 40
Q E I V V T E A N T C E H L A D T Y R G
CArGArAThGTnGTnACnGArGCnAAyACnTGyGArCAyCTnGCnGAyACnTAyCGnGGn
TTr AGr
45 50 55 60
V C F T N A S C D D H C K N K A H L I S
GTnTGyTTyACnAAyGCnTCnTGyGAyGAyCAyTGyAArAAyAArGCnCAyCTnAThTCn
AGy TTr AGy
65 70
G T C H D W K C F C T Q N C
GGnACnTGyCAyGAyTGGAArTGyTTyTGyACnCArAAyTGy
With the Universal Genetic code, ambiguity symbols make it possible
to represent all possible codons for an amino acid using two output
lines. It is important to realize that the ambiguities on each line
can not be combined. For example, CTn and TTr represent all codons for
Leucine. However, attempting to combine them into a single triplet,
yTn, would be incorrect. For example, TTT and TTC are codons for
Phenylalanine, not Leucine.
FUTURE PLANS
1. It wouldn't be hard to have the output printed as nucleic acid
sequences in Perason format, so that the output could be read back
into GDE. I don't know why you would want to do this, but it could
be done.
2. Right now, only the Universal Genetic Code is used, but it should
be possible to read in alternative genetic codes, have prot2nuc
figure out the ambiguity rules (as is already done in ribosome) and
print out the appropriate ambiguous codons.
3. It might be useful to have each possible codon printed out, rather
than ambiguous codons. This would take up a lot more space and
wouldn't be as pretty. If there's a lot of demand I could do this.
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca