gde_linux/CORE/xylem/getob.doc


      GETOB                                                 21 Dec 94


      NAME
           getob - Get an object from GenBank

      SYNOPSIS
           getob [-frcn] infile namefile anofile seqfile indfile message
                 [outfile] expfile

      DESCRIPTION
           getob extracts 'objects' (subsequences)  from GenBank entries, using
           the features table, and writes them to outfile (.out).  A log
           describing the construction of each object is written to message
           (.msg). If -r is not set, a list of expressions that would recreate
           the .out file if evaluated by getob -r, is written to expfile (.exp)

           The following options are available:

	   f    Write each entry to a separate file. The name will consist
                of the entry name, and the extension '.obj'.

	   r    Resolve expressions from namefile into objects.
                Expressions take the form:

                @[<database>::]<accession>:<location>

                In effect, r makes it possible to use getob to resolve
                features that span more than one entry, such as segmented
                files.  In the first run of the program, features that require
                data from outside the entry in which they are defined will be
                written to outfile with those externally-defined parts rep-
                resented using the '@' notation described above.  During a
                subsequent run, the outfile from the previous run is used as
                namefile.  When r is set, all lines not beginning with '@' (ie.
                name lines and sequence lines) are simply copied to the new
                outfile.  When an '@' is encountered, the expression is parsed
                into accession number and location.  The entry with the
                specified accession number is located in indfile, and read from
                anofile and seqfile.  It is then evaluated, and the result
                written to outfile in place of the '@' expression.

                getob can also be used to get specific labeled objects from
                a given entry.  Examples:

                @k30576:polyprotein
                @k30576:/label=polyprotein
                @x10345:/product="hsp70"
                @j00879:group(1..2200,mutation_37)

                The first two constructs given above are equivalent.  Both
                will extract the feature called polyprotein.  The third
                construct shows that any feature label can be specified. If
                none is specified, as in the first example, then /label= is
                assumed.  One limitation, however, is that the label sought
                must be unique within the entry in its first 15 characters
                including double quotes (").  Otherwise, only the first
                matching label expression will be evaluated.  Finally, the
                last example shows that a mutant sequence can be constructed
                by first specifying an expression that evaluates to a
                sequence (ie. 1..2200) and then a labeled expression that
                upon evaluation, uses replace() to modify that sequence. The
                usage shown in examples 3 & 4 above represent extensions to
                the DDBJ/EMBL/GenBank Features Table Format.

                As touched on briefly above, the r option makes it possible
                to construct objects that include recursive references to
                other entries (eg. segmented files) by iterative calls to
                getob. The 'features' command automates this process. The basic
                algorithm is as follows:

                getob infile namefile anofile seqfile indfile ...

                #Pull out all lines containing indirect references
                grep '@' outfile > unresolved.grep

                while (unresolved.grep is not empty)

                     #extract accession numbers to be retrieved
                     cut -c2-7 unresolved.grep > unresolved.nam

                     #retrieve the sequences into a new file, and create
                     #a database subset to be used by getob
                     fetch unresolved.nam new.gen
                     splitdb new.gen new.ano new.wrp new.ind

                     #run getob again to resolve indirect references
                     getob -r infile outfile new.ano new.wrp new.ind ...

                     #Pull out all lines containing indirect references
                     grep '@' outfile > unresolved.grep
                   end

            c   NAMEFILE contains accession numbers, rather than locus names

            n   By default, the qualifier 'codon_start' is used to determine
                how many n's, if necessary, must be added to the 5' end of
                CDS, mat_peptide, or sig_peptide, to preserve the reading
                frame. To turn OFF this feature, -n must be set. -n must be set
                for GenBank Releases 67.0 and earlier.

            infile contains commands indicating what data is to be pulled from
            each entry. Two types of output may be presented, GenBank or
            OBJECTS.  These are described below:

            1) GenBank output - If the word 'GENBANK' is the first line in
            infile, a pseudo-GenBank entry will be recreated.  This option
            is only intended for debugging purposes and will probably be
            removed in later releases.

            2) Object format - This option instructs getob to write part or
            all of each sequence, along with site annotation, by specifying
            feature key names.  The syntax for infile is shown below:

            Backus-Naur format:                  Example:
            ----------------------------------------------------------
            OBJECTS                              OBJECTS
            <feature key>                        tRNA
            {<feature key>                       rRNA
               . . .                             SITES
            <feature key>}                       stem_loop
            SITES
            {<feature key>
               . . .
            <feature key>}

            In the example above, getob is instructed to extract all tRNA or
            rRNA sequences from each entry, and annotate the position of each
            stem/loop structure. Note that the SITES coordinates written to the
            file tell the positions of those SITES relative to the start of the
            object, rather than the original location in the sequence. As above,
            each word begins a separate line.

            While the -r option does not use infile, at least a dummy infile
            must be included in the command line. This dummy file need only
            contain two lines:

            OBJECTS
            SITES

            NOTE: SITES IS NOT YET IMPLEMENTED! Although inclusion of SITES in
            the input file will have no effect, the word SITES must still be
            present after the last feature key.


            namefile
            namefile consists of a list of LOCUS names or accession numbers,
            each on a separate line. Names or accession numbers should appear
            in the order in which they appear in the database file.  Unordered
            namefiles will slow the progress of the search. Since only the
            first non-blank field of each line in namefile is read, indfile
            could be used to create a namefile.  If the entire indfile was
            used, the entire database file would be processed. A sample
            namefile requesting four sequences by LOCUS name is shown below:

            POTPR1A
            POTPSTH2
            POTPSTH21
            POTSTHA

            anofile, seqfile, and indfile
            The database subset containing GenBank entries must be divided
            among annotation, sequence and an index by splitdb.

            message
            message contains a log describing the parsing of each object.
            For annotative purposes, qualifier lines from the object are
            included in along with the location expression being parsed.
            The beginning of a typical message file is shown below:

            GETOB          Version 0.962     14 May 1992

            POTPR1A:CDS1
                join
                    (
                         295                 603

                         1011                 1355

                    )


            /note="pathogenesis-related protein (prp1)"
            /codon_start=1
            /translation="MAEVKLLGLRYSPFSHRVEWALKIKGVKYEFIEEDLQNKSPLLL
            QSNPIHKKIPVLIHNGKCICESMVILEYIDEAFEGPSILPKDPYDRALARFWAKYVED
            KGAAVWKSFFSKGEEQEKAKEEAYEMLKILDNEFKDKKCFVGDKFGFADIVANGAALY
            LGILEEVSGIVLATSEKFPNFCAWRDEYCTQNEEYFPSRDELLIRYRAYIQPVDASK"
            //----------------------------------------------

            In the example above, getob was instructed to retrieve all CDS
            features from the database subset. The message for the entry
            POTPR1A is shown, along with a reconstruction of the location
            expression that was evaluated to create the object. In this
            case, protien coding sequences from two exons had to be joined
            to create the object.

            outfile
            outfile contains the actual objects constructed, consisting of
            sites found and sequences. The beginning of a typical output file
            is shown below:

           >POTPR1A:CDS1
           atggcagaagtgaagttgcttggtctaaggtatagtccttttagccatag
           agttgaatgggctctaaaaattaagggagtgaaatatgaatttatagagg
           aagatttacaaaataagagccctttacttcttcaatctaatccaattcac
           aagaaaattccagtgttaattcacaatggcaagtgcatttgtgagtctat
           ggtcattcttgaatacattgatgaggcatttgaaggcccttccattttgc
           ctaaagacccttatgatcgcgctttagcacgattttgggctaaatacgtc
           gaagataag
           ggggcagcagtgtggaaaagtttcttttcgaaaggagaggaacaagagaa
           agctaaagaggaagcttatgagatgttgaaaattcttgataatgagttca
           aggacaagaagtgctttgttggtgacaaatttggatttgctgatattgtt
           gcaaatggtgcagcactttatttgggaattcttgaagaagtatctggaat
           tgttttggcaacaagtgaaaaatttccaaatttttgtgcttggagagatg
           aatattgcacacaaaacgaggaatattttccttcaagagatgaattgctt
           atccgttaccgagcctacattcagcctgttgatgcttcaaaatga

           In the example, the CDS from entry POTPR1A  has been written in
           two chunks, corresponding to the two exon portions of the coding
           sequence. Each location retrieved in constructing the object is
           written as a separate block of sequence. By comparing message file
           to outfile, it is possible to verify the correctness of the
           operation.

           Numbers are appended to the sequence names to indicate
           which CDS in the entry has been retrieved. Thus, if two CDS
           features were present, the second one would be named >POTPR1A:2.
           For compatiblility with the FASTA programs of Pearson, the name line
           begins with a '>'.

           expfile
           The expression evaluated to create this feature is written
           to expfile:

           >POTPR1A:CDS1
           @J03679:join(295..603,1011..1355)

           expfile is only created if -r is not set. It is itended as a way
           of automating the creation of a feature expression file for use
           in generating customized datasets. Expressions in expfile can be
           deleted or modified, or new expressions added, to tailor the
           dataset to individual needs. To generate a dataset from expfile:

           getob -r infile expfile anofile seqfile indfile message outfile

      EXTENSIONS TO THE FEATURE TABLE LANGUAGE

          1) poly(<absolute_location>|<literal>|<feature_name>,x)

             This operator evaluates an absolute location, literal, or
             feature name (ie. any location not containing functional
             operators) and writes it x times. The most obvious
             application of poly is to create spacers to represent regions
             of unknown sequence between sequences that are known. For
             example, the restriction map of a 4kb EcoR1 fragment with a
             Hind3 site 1000 bp from one end could be represented as follows:

             join("gaattc",poly("n",1000),"aagctt",poly("n",3000),"gaattc")

          2) The following feature keys are recognized by GETOB, although
             not included in the language definition. While they will not
             appear in GenBank entries, they could be used in user-created
             GenBank-format files:

             contig
             This feature key is meant to be used to assemble large
             sequence segments from smaller segments, possibly using the
             poly() operator.

             chromosome
             Intended to annotate the complete sequence of a chromosome. This
             feature may be constructed by a join of two or more contigs.

             Use of these keywords is illustrated in the features table
             shown below, which could be used to construct a model of part
             of the E.coli chromosome, spanning map units 763.4 to 1031.4 kb:

             contig          join(J01619:1..13063,poly("n",7140),
                             J03939:1..1363,poly("n",14380),
                             X02306:complement(1..1622),poly("n",14710),
                             J04423:1..5793,poly("n",22500),
                             X03722:1..2400,poly("n",123750),
                             one-of(X05017:complement(1..1854),X05017:1..1854))
                             /label=Eco_contig8
                             /map=763.4-950.6kb
             contig          join(V00352:1..2412,poly("n",28800),M15273:1..3409)
                             /label=Eco_contig9
                             /map=972.9-1001.7kb
             contig          join(X02826:1..1357,poly("n",13540),
                             J01654:complement(1..2270))
                             /label=Eco_contig10
                             /map=1016.5-1031.4kb
             chromosome      join(Eco_contig8,poly("n",22300),
                             Eco_contig9,poly("n",14800),
                             Eco_contig10)
                             /label=Ecoli_chromosome

      NOTES
        1)  If the const DEBUG is set to true in the Pascal source code, getob
            writes messages to the standard output, indicating the progress of
            processing for each entry read in. By default, DEBUG=false.
            This feature is solely for debugging purposes and will be removed in
            later releases.

        2)  GETOB automatically expands leading blanks that have been
            compressed using splitdb -c. See splitdb.doc for more information.

      SEE ALSO
            features, splitdb, getloc
            The DDBJ/EMBL/GenBank Feature Table: Definition, Version 1.04
              September 1, 1992
            GenBank Release Notes for Release 79.0.

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.