gde_linux/CORE/xylem/prot2nuc.doc

      prot2nuc                                                update 10 Aug 94

      NAME
           prot2nuc -  reverse translates protein into nucleic acid

      SYNOPSIS
           prot2nuc [-ln -gn] < input > output

      DESCRIPTION
           prot2nuc reads a file containing an amino acid sequence
           and writes the corresponding reverse translated nucleic acid
           sequence, using the standard IUPAC-IUB ambiguity codes to output.
           The amino acid sequence may contain internal stop '*' characters.
           That is, all legal amino acid characters will be processed.

           -ln    print n amino acids/codons per line. (default = 25)

           -gn    number the amino acid sequence every n amino acids/codons.
                  (defalut = 5)

           If l is not evenly divisible by g, the defaults are used.

           input - If the first line of the file begins with '>' or ';',
           input will be read as the standard .wrp (Pearson) format,
           such as that produced by getob:

           >name
           sequence lines


           Otherwise, it will be assumed that the file ONLY contains
           sequence, and all legal IUPAC/IUB DNA characters will be
           read as sequence.

           output - The output begins with a header, listing the both
           1 and 3 letter amino acid codes [J. Biol. Chem. 243, 3557-3559
           (1968)], as well as the nucleic acid ambiguity codes [Cornish-
           Bowden (1985) Nucl. Acids Res. 13:3021-3030.]. The amino acid
           sequence, along with its reverse translation, are then printed on
           lines of l amino acids/codons, numbering every g amino acids/codons.
           Non-ambiguous nucleotides appear capitalized, while ambiguous
           nucleotides are in lowercase. A sample output file appears below:

     PROT2NUC       Version  8/10/94

     IUPAC-IUP AMINO ACID SYMBOLS
     [J. Biol. Chem. 243, 3557-3559 (1968)]

          Phe         F          Leu         L          Ile         I
          Met         M          Val         V          Ser         S
          Pro         P          Thr         T          Ala         A
          Tyr         Y          His         H          Gln         Q
          Asn         N          Lys         K          Asp         D
          Glu         E          Cys         C          Trp         W
          Arg         R          Gly         G          STOP        *
          Asx         B          Glx         Z          UNKNOWN     X


     IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE
     [Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.]

     Symbol         Meaning              | Symbol         Meaning
     ------------------------------------+---------------------------------
     G              Guanine              | k              G or T
     A              Adenine              | s              G or C
     C              Cytosine             | w              A or T
     T              Thymine              | h              A or C or T
     U              Uracil               | b              G or T or C
     r              Purine (A or G)      | v              G or C or A
     y              Pyrimidine (C or T)  | d              G or T or A
     m              A or C               | n              G or A or T or C

     pI39
                   5             10             15             20
     M  E  K  K  S  L  A  A  L  S  F  L  L  L  L  V  L  F  V  A
     ATGGArAArAArTCnCTnGCnGCnCTnTCnTTyCTnCTnCTnCTnGTnCTnTTyGTnGCn
                 AGyTTr      TTrAGy   TTrTTrTTrTTr   TTr

                  25             30             35             40
     Q  E  I  V  V  T  E  A  N  T  C  E  H  L  A  D  T  Y  R  G
     CArGArAThGTnGTnACnGArGCnAAyACnTGyGArCAyCTnGCnGAyACnTAyCGnGGn
                                            TTr            AGr

                  45             50             55             60
     V  C  F  T  N  A  S  C  D  D  H  C  K  N  K  A  H  L  I  S
     GTnTGyTTyACnAAyGCnTCnTGyGAyGAyCAyTGyAArAAyAArGCnCAyCTnAThTCn
                       AGy                              TTr   AGy

                  65             70
     G  T  C  H  D  W  K  C  F  C  T  Q  N  C
     GGnACnTGyCAyGAyTGGAArTGyTTyTGyACnCArAAyTGy


     With the Universal Genetic code, ambiguity symbols make it possible
     to represent all possible codons for an amino acid using two output
     lines. It is important to realize that the ambiguities on each line
     can not be combined. For example, CTn and TTr represent all codons for
     Leucine. However, attempting to combine them into a single triplet,
     yTn, would be incorrect. For example, TTT and TTC are codons for
     Phenylalanine, not Leucine.

     FUTURE PLANS
     1. It wouldn't be hard to have the output printed as nucleic acid
     sequences in Perason format, so that the output could be read back
     into GDE. I don't know why you would want to do this, but it could
     be done.
     2. Right now, only the Universal Genetic Code is used, but it should
     be possible to read in alternative genetic codes, have prot2nuc
     figure out the ambiguity rules (as is already done in ribosome) and
     print out the appropriate ambiguous codons.
     3. It might be useful to have each possible codon printed out, rather
     than ambiguous codons. This would take up a lot more space and
     wouldn't be as pretty. If there's a lot of demand I could do this.

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca