gde_linux/CORE/xylem/features.doc


      FEATURES.DOC                                          update  7 Feb 94


      NAME
            FEATURES - extracts features from GenBank entries

      SYNOPSIS
            features
            features expression
            features [-f featurekey | -F keyfile]
                     [-n name     |-a accession    | -e expression |
                      -N namefile |-A accfile      | -E expfile]
                     [-u dbfile   | -U dbfile      | -g ]
            features -h

      DESCRIPTION
            FEATURES extracts sequence objects from GenBank entries, using
            the Features Table language. Features can be retrieved either by
            specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by
            evaluating expressions. In practical terms, FEATURES is actually
            a user interface for GETOB, which actually performs the parsing
            and extraction of sequence objects. FEATURES can be run either as
            an interactive program or with command line arguments.

            'features' with no arguments runs the program interactively.
            'features' followed by an expression retrieves the data directly
            from GenBank and evaluates the expression. The third form of
            features requires all arguments to be accompanied by their
            respective option flags. Finally, 'features -h' prints the
            SYNOPSIS.


      INTERACTIVE EXECUTION
      FEATURES executed with no arguments runs interactively. An example of the
      FEATURES menu is shown below:

      ___________________________________________________________________
                           FEATURES - Version    7 FEB 94
          Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
      ___________________________________________________________________
      Features:  tRNA
      Entries:   EPFCPCG
      Dataset:
      ___________________________________________________________________
         Parameter              Description                      Value
      -------------------------------------------------------------------
      1).................... FEATURES TO EXTRACT ....................> f
        f:Type a feature at the keyboard
        F:Read a list of features from a file
      2)....................ENTRIES TO BE PROCESSED (choose one).....> n
        Keyboard input - n:name     a:accession #     e:expression
        File input     - N:name(s)  A:accession #(s)  E:expression(s)
      3)....................WHERE TO GET IT .........................> g
        u:Genbank dataset   g:complete GenBank database
        U: same as u, but all entries
      4)....................WHERE TO SEND IT ........................> a
        s:Each feature to a separate file  a:All output to same file
        ---------------------------------------------------------------
         Type number of your choice or 0 to continue:
      0
      Messages will be written to EPFCPCG.msg
      Final sequence output will be written to  EPFCPCG.out
      Expressions will be written to EPFCPCG.exp
      Extracting features...

      In the example, FEATURES was instructed to retrieve all tRNAs from
      the GenBank entry EPFCPCG, which contains the Epifagus plastid
      genome. By default, the GenBank database was the source of the
      sequence. Messages indicate the progress of the job. A log describing
      the extraction of each feature is written to EPFCPCG.msg, while the
      extracted features themselves are written to EPFCPCG.out. Feature
      expressions which could be used by FEATURES to reconstruct the .out
      file, are written to EPFCPCG.exp.

      The first step is to retrieve the EPFCPCG entry from GenBank, which is
      accomplished by calling FETCH. Next, FEATURES extracts the specified
      features from the entry.

      An excerpt from EPFCPCG.msg is shown below, describing the extraction
      of the fifth tRNA found in this entry. To create this tRNA,  two exons
      had to be joined. The qualifier line associated with this feature
      indicates that it is an Isoleucine tRNA with a gat anticodon.


      EPFCPCG:anticodon gtg
          complement
              (
                  join
                (
                           70023                         70028

                           1                         69

                      )

              )


      /product="transfer RNA-His"
      /gene="His-tRNA"
      /label=anticodon gtg
      /note="anticodon gtg"
      //----------------------------------------------


	 The actual sequence for this feature, as written to EPFCPCG.out, is
	 written with each exon beginning a new line:

      >EPFCPCG:anticodon gtg
      ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat
      gcgggttcaattcccgtcg
      ttcgcc

      Finally, the expression that was evaluated to create this feature is
      written to EPFCPCG.exp:

      >EPFCPCG:anticodon gtg
      @M81884:anticodon gtg

      If EPFCPCG.exp was used as an expression file in option 2 (E) of FEATURES,
      EPFCPCG.out would be recreated.

      OPTIONS
      1) FEATURES - choosing f will cause FEATURES to prompt for
	 a feature to extract. If you wish to extract several types of
	 features simultaneously (ie. F), you must construct a file listing the
	 feature keywords. The following example would retrieve both tRNA and
	 rRNA sequences:

	 OBJECTS
	 tRNA
	 rRNA
	 SITES

      The words 'OBJECTS' and 'SITES' must enclose the feature keywords,
      and each keyword must be on a separate line. For a rigorous
      definition of the input file format, see the GETOB manual pages
      (getob.doc).

       In the menu shown above, f was chosen, and the user entered tRNA at
       the prompt. Thus tRNA is now displayed on the Features: line. If
       features had been specified from a file (suboption F) then the
       filename containing the feature keywords would be displayed instead.
       A complete list of legal feature keywords can be found in the GenBank
       Release notes (gbrel.txt) under the subheading 'Feature Key Names'.

	 2) ENTRIES
         n  User is prompted for the name of an entry from which the
	    feature is to be extracted. The name of the entry will appear
	    on the 'Entries' line of the menu.

	 N  User is prompted for a filename containing one or more
	    entry names. Each name must be on a separate line. The filename
	    will be displayed on the 'Entries' menu line.

         a  User is prompted for an accession number, which will appear
	    on the 'Entries' line of the menu.

         A  User is prompted for a filename for accession numbers. The filename
            will appear on the 'Entries:' line.

         e  User is prompted for a GenBank Features expression of the
	    form  accession:location.'accession' refers to a GenBank
	    accession number, while 'location' is any legal feature location.
	    A brief description of location syntax can be found under the
	    subheading "Feature Location" in the GenBank release notes
	    (gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table:
	    Definition" Version 1.04 for a complete definition.
         E  User is prompted for a filename containing one or more Feature
	    expressions. EACH EXPRESSION MUST BEGIN A '@'. All lines beginning
            with '@' are processed as expressions, and all other lines are
            copied to the output file unchanged.

         Examples:

	    The tRNA shown above could have been extracted by choosing
	    suboption e and entering either of the following expressions:

            M81884:complement(join(70023..70028,1..69))
            M81884:anticodon gtg

	    In the first example, the feature line from the original entry
	    is used as the location. In the second example, the feature is
	    found by its qualifier line, which also appeared in the
	    original entry. It must be noted that the qualifier line must
	    be unique from others in the same entry in its first 15
	    characters after the = .

	    The flaL protein coding region of B. licheniformis is described
	    in GenBank entry BLIFALA, accession number M60287 in the
	    following feature:

            CDS             305..640
                            /note="flaD (sin) homologue"
                            /gene="flaL"
                            /label=ORF2
                            /codon_start=1

         This feature could be retrieved using any of the following
         expressions:

		 M60287:305..640
		 M60287:ORF2
		 M60287:/label=ORF2
		 M60287:/gene="flaL"
		 M60287:/note="flaD (sin) homologue"

           Note  that the /label= qualifier is special, in that labels are
           specifically intented as unique tags on an feature. For labels,
           only the label itself is need be specified. Thus, /label=ORF2 is
           equivalent to ORF2.  For other qualifiers, the qualifier keyword
           (eg. /note=) must be included.

	 3) DATABASE (WHERE TO GET IT) - By default, all entries processed will
	 be automatically retrieved from GenBank using FETCH. Specifying 'u'
	 (User-defined database subset) makes it possible to extract features
	 from GenBank subsets created by the user. Usually, retrieval of
	 features is much faster with a User-defined subset, so if you
	 frequently work with sets of genes, it is best to retrieve them
	 en-masse using FETCH, and work with them directly. For example, if
	 you had retrieved a set of Beta-globin sequences into a file called
	 'globin.gen', you could directly extract features from these entries
	 by specifying 'globin' or 'globin.gen' as your User-defined database.
	 If the file extension is '.gen', FEATURES will automatically create
	 temporary files called globin.ano, globin.wrp and globin.ind,
	 containing annotation, sequence, and an index, respectively. These
	 files will be read during feature extraction, and then discarded. If
	 you have already created such files using SPLITDB, simply specify
	 any of 'globin', 'globin.ano', etc. ie. anything, as long as it does
	 not have the .gen file extension.

         'U' rather than 'u' causes ALL entries in the user-defined
         database to be subset. This means that it is unnecessary to
         specify entry options (eg -n, -N etc.), as these will be
         ignored, if given.

	 One consequence of these conventions is that the individual GenBank
	 divisions can be processed directly. For example, suppose you were only
	 interested in rodent globins.  You could directly access the rodent
	 division of GenBank by specifying the base name of that file division
	 (eg. /home/psgendb/GenBank/gbrod) as your user-defined database. In
	 this case, the files gbrod.ano, gbrod.wrp and gbrod.ind already
	 exist. Again, this approach is faster, since FEATURES would not have
	 to find and retrieve the sequences, but can read directly from the
	 database files. Finally, if you wanted to process all of the entries
	 in the database division, simply use -U. The user is warned that a
	 GenBank division is a huge amount of data, and processing every entry
	 could take a long time.

	 4) WHERE TO SEND IT - By default (a), the output for all entries goes
	 to a single set of files, whose names are chosen by FEATURES,
	 depending on the setting of option 2, Entries. If a single name (n) or
	 accession number (a) has been chosen, that will be used as
	 the raw filename. For example, if you were processing the entry
	 WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names
	 (N), accession numbers (A) or expressions (E) were read from a file,
	 the raw name of that file would be used eg. cellulase.nam would result
	 in cellulase.msg and cellulase.out.  Finally, if a single expression
	 is processed (e), then the primary accession number in that
	 expression will be used for the filenames. In all cases, FEATURES
	 will tell you the names of the files being written.

	 Choosing suboption s, you can specify that the features created for
	 each entry be sent to separate files. In this case, each file will
	 have the name of that entry, with the extension .obj. However, all
	 messages and expressions  will still go to a single files. While this
         can be a convenient way of creating separate files when you need them,
         this option still has the limitation of writing all features for a
         given entry (if there are more than one) to the same file. Also,
         successive resolution of features (anything requiring 'getob -r')
         will not work with this option. This may be corrected in future
         versions.


      COMMAND LINE EXECUTION

      There are two ways of running FEATURES from the command line. If only one
      argument is supplied, that argument is interpreted as an expression, and
      the result of that expression (ie. a sequence ) is written to the
      standard output. .msg, .out and .exp files are NOT created. For example,
      GenBank entry BACFLALA (M60287) contains the following feature:

      CDS             95..271
                      /label=LORF-
                      /codon_start=1
                      /translation="MNKDKNEKEELDEEWTELIKHALEQGISPDDIRIFLNLGKKSSK
                      PSASIERSHSINPF"
      Any of

      features M60287:LORF-
      features M60287:95..271
      features M60287:/label=LORF-

      would write the open reading frame to the standard output:

      atgaataaagataaaaatgagaaagaagaattggatgaggagtggacaga
      actgattaaacacgctcttgaacaaggcattagtccagacgatatacgta
      tttttctcaatttgggtaagaagtcttcaaaaccttccgcatcaattgaa
      agaagtcattcaataaatcctttctga

      This form of FEATURES is provided to make it easy to pipe output to
      other programs for further processing. For example

      features M60287:LORF- |ribosome >LORF.protein

      would write the translation of the open reading frame to a file called
      LORF.protein.

      The full functionality of the FEATURES can be accessed using arguments on
      the command line. In particular, when there are multiple entries to be
      processed, or multiple features within entries, it is much faster to
      supply FEATURES with lists of entries, feature keys or expressions.
      Command line options are similar to suboptions in menu items 1-3 above:

      Feature keys:
       -f  key               {feature key}
       -F  filename          {file of feature keys}

       Entries:
       -n name                {GenBank LOCUS name}
       -N filename            {file of GenBank LOCUS names}
       -a accession           {GenBank ACCESSION number}
       -A filename            {file of GenBank ACCESSION numbers}
       -e expression          {Feature Table expression}
       -E filename            {file of Feature Table expressions, each begin-
                               ning with '@'}

        Databases:
        -u filename           {GenBank dataset}
        -U filename           { "      "        "  "    "       "    ,
                              process all entries ie. -nNaAeE options
                              will be ignored}
        -g                    {GenBank}

        Examples:

        features -f tRNA -n EPFCPCG

        retrieves all tRNAs from GenBank entry EPFCPCG and writes .msg, .out,
        and .exp files.

        features -e M60287:LORF-

        would retrieve the same open reading frame as in the earlier example.


        Since most time-consuming operation in FEATURES is sequence retrieval,
        it is often best to retrieve frequently-used sequences as database
        subsets. For example, a set GenBank entries for chlorophyl a/b binding
        protein genes might be stored in a file called CAB.gen.

        features -f CDS -N CAB.nam -u CAB.gen

        would generate the files CAB.msg, CAB.out and CAB.exp containing output
        for all CDS features in the entries listed in the file CAB.nam.

        features -E CAB.exp -u CAB.gen

        would re-create the output file CAB.out.


     BUGS
       FEATURES does no preliminary error checking for syntax of
       GenBank expressions prior to their evaluation. Expressions that can
       not be evaluated will be flagged by GETOB in the .msg file.

       At present, little checking is done to test for the presence or
       correctness of input files. Some errors may cause the program to
       crash.

       For User-defined datasets, filename expansion is not performed.

     FILES
        Temporary files:
          X.term X.ano X.wrp X.ind X.gen {X is raw filename, see 4) }
          UNRESOLVED.fea UNRESOLVED.out
          FEA.inf FEA.nam FEA.gen FEA.ano FEA.wrp FEA.ind FEA.msg FEA.out

     SEE ALSO
            grep(1V) fetch getob splitdb

     TRANSPORTATION NOTES
	    It should be fairly easy to get FEATURES to work even on systems
	    in which GenBank has not been formatted for the XYLEM package.
	    This is because FEATURES does not work directly on the database, but
	    rather retrieves all necessary sequences by calling FETCH. Thus,
	    statements like 'fetch FEA.nam FEA.gen' could be replaced with any
	    command that, given a file containing names or accession numbers,
	    returns a file containing GenBank entries. In principle, you
	    could even implement this sort of command to retrieve entries from
	    the email server (retrieve@ncbi.nlm.nih.gov) at NCBI, although
	    such a setup would undoubtedly be quite slow.

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.