gde_linux/CORE/xylem/fetch.doc


      FETCH.DOC                                            update 24 Feb 96


      NAME
            fetch - retrieves database entries by name or accession number

      SYNOPSIS
            fetch                                    {interactive mode}
            fetch [options] namefile [output file]   {batch mode}

      DESCRIPTION
	    fetch retrieves one or more entries from a database.

      Interactive mode: fetch prompts the user to set search parameters,
      using an interactive menu:
      ___________________________________________________________________
                           FETCH - Version   7 Feb 94
         Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
      ___________________________________________________________________
      Namefile:
      Outfile:
      Database:
      -------------------------------------------------------------------
         Parameter              Description                      Value

      1) Name/Acc    Name or Accession sequence to get
      2) Namefile    Get list of sequences from Namefile
      3) WhatToGet   a:annotation s:sequence b:both                b
      4) Database    g:GenBank p:PIR  v:VecBase l:LiMB             g
                     G:GenBank dataset   P:PIR dataset
      5) Outfile     Send all output to a single file (Outfile)
      6) Files       f:Send each entry to a separate file          f
         -------------------------------------------------------------
         Type number of your choice or 0 to continue:

            After all parameters have been set, type 0 to commence the search.
            Messages regarding the progress of the search will be printed.

            (1,2) Which entries to get?
            If you want to get a single entry, option 1 lets you type in the
            name of that entry, without having to create a namefile. To get
            more than one entry, choose option 2, and specify the name of a
            file containing sequence names or accession numbers.

            namefile is a file containing one or more sequence names or
	       accession numbers, each on a separate line. Names and accession
            numbers can even be interspersed, in upper or lowercase, and in
            any order. For example, the namefile prp.nam might contain

            ; plant pathogenesis related proteins
            ;   (these are sample comment lines)
            ;   note that any line containing a semicolon is ignored
            x06362
            x05454
            TOBPR1A1
            ;  comments can be interspersed with names.
            PUMPR13
            tobpr1ar

            Options 1 & 2 are mutually exclusive. Setting one will negate the
            other. If option 2 is chosen, the name of the namefile will appear
            at the top of the menu.

            (3) WhatToGet
            Use this option to specify whether to get annotation, sequence,
            or both (default=both).

            (4) Database
            Use this option to select the database. (default=GenBank).
            G and P select user-created database subsets containing GenBank
            or PIR entries, respectively. It is assumed that the database
            has been split into .ano, .wrp and .ind files using splitdb.
            For example, if you had created a database subset called PR1.pir,
            splitdb would create PR1.ano, PR1.wrp and PR1.ind. These are
            the files actually read by FETCH. When prompted for the name
            of the database, simply type "PR1", without a file extension.
            (If you do type a file extension, it will be ignored).

            (5, 6) Where to send output
            By default, option 6 is set to f, and each entry will be written to
            a separate file, where the name of the file is the name of the
            entry, followed by a file extension. If a complete entry is
            retrieved, the file extension will indicate the type of database
            (GenBank: .gen;  PIR: .pir, Vecbase: .vec; LiMB: .LiMB). If only
            annotation or sequence are retrieved, the file extensions will be
            .ano or .wrp, respectively. Using the default, the namefile above
            would create the following files:

            PUMPR13.gen
            TOBPR1A1.gen
            TOBPR1AR.gen
            TOBPR1CR.gen
            TOBPR1PS.gen

            By choosing option 5, you can specify the name of an output file
            for all entries to go to. This filename will appear at the top
            of the menu. Obviously, options 5 & 6 are mutually exclusive.
            Note entries retrieved are writen in alphabetical order (sorting by
            ASCII values), not the order in which they appeared in namefile.

            (Note for remote users only: -f will only work for a single
            name/accession supplied in 1). -f IS NOT ENABLED FOR NAMEFILES
            specified in 2).)

       Batch mode:
	    Although it is transparent to the user, all fetch really does
	    is call getloc, saving the user the trouble of knowing which
	    database files to retrieve sequences from, or of having to
	    execute getloc multiple times to retrieve sequences from
	    different database files. Thus, the options are identical to those
            for getloc:

            -a    Write annotation portions of entries only, terminated by '//'.
            -s    Write sequence data only, in Pearson (.wrp) format.
            -f    Write each entry to a separate file.
            -g    GenBank (default)
            -e    EMBL  {not implemented}
            -p    PIR (NBRF)
            -v    Vecbase
            -l    LiMB
            -G    GenBank_dataset
            -P    PIR_dataset

            If -f is not specified, outfile must be specified.

            -L    force execution of findkey on local host even if
                  $XYLEM_RHOST is set. See "REMOTE EXECUTION" below


         PIR_dataset
         GenBank_dataset
         This can be either a file of PIR entries, a file of GenBank entries,
         or a XYLEM dataset created by splitdb. A file of PIR entries must
         have the file extension ".pir". A file of GenBank entries must have
         the file extension ".gen". A XYLEM dataset contains PIR entries split
         among three files by splitdb: annotation (.ano), sequence (.wrp)
         and index (.ind). These file extensions must be used!

         When specifying a split dataset, only the base name needs to be
         used. For example given a XYLEM dataset consisting of the files
         myset.ano, myset.wrp and myset.ind, the following two commands
         are equivalent:

         fetch -P myset  something.nam something.pir
         fetch -P myset.ano something.nam something.pir

         If the original .pir file had been used, the command would have
         been

         fetch -P myset.pir something.nam something.pir

         The ability to work directly with .gen or .pir files is quite
         convenient. However, since FETCH needs to work with a split
         FETCH automatically splits .pir or .gen files into .ano, .wrp
         and .ind files, which are removed when finished. This requires
         extra disk space and execution time, which could be significant
         for large datasets.

     EXAMPLES
	Batch example:
	fetch  -f chitinase.nam
            will retrieve annotation and sequence for sequences listed in
            chitinase.nam from GenBank, writing each entry to a separate file
            with the extension .gen.

        fetch -s -v pbr.nam pbr.wrp
            will retrieve sequence data only for the entries listed in pbr.nam,
            from VecBase, and write all sequences to a Pearson format file
            (ie. readable by fasta) with the name pbr.wrp.

        fetch -G sample sample.nam new.gen
        fetch -G sample.ano  sample.nam new.gen
            Assumes that a set of GenBank entries has been split by splitdb
            into sample.ano sample.wrp and sample.ind. The entries listed in
            sample.nam are written to new.gen.


      FILES
        Database files:
          The directories for database files are specified by the environment
          variables $GB (GenBank) $PIR (PIR/NBRF) $VEC(Vecbase) and $LIMB
          (LiMB).

          Index files are $GB/gbacc.idx for GenBank (this file is supplied
          with each GenBank release), while the other databases
          use .ind files generated by splitdb. Split database files MUST
          have the following file extensions: .ano {annotation}, .wrp
          {sequence} and .ind {index}. Thus, when creating database files
          for pir1.dat with splitdb, the output files should be pir1.ano,
          pir1.wrp and pir1.ind.

        Temporary files:
          NAMEFILE.fetch
          PRELIMINARY.fetch
          TMP.fetch
          FOUND.fetch
          FETCHDIR  {temporary directory}

     REMOTE EXECUTION
          Where the databases can not be stored locally, FETCH can call
          FETCH on another system and retrieve the results. To run
          FETCH remotely, your .cshrc file should contain the following
          lines:

          setenv XYLEM_RHOST remotehostname
          setenv XYLEM_USERID remoteuserid

          where remotehostname is the name of the host on which the
          databases reside (in XYLEM split format) and remoteuserid
          is your userid on the remote system. When run remotely,
          your local copy of FETCH will generate the following
          commands:

          rcp filename $XYLEM_USERID@$XYLEM_HOST:filename
          rsh $XYLEM_RHOST -l $XYLEM_USERID fetch ...
          rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename
          rsh $XYLEM_RHOST -l $XYLEM_USERID $RM temporary_files

          Because FETCH uses rsh and rcp, your home directory on both
          the local and remote systems must have a world-readable
          file called .rhosts, containing the names of trusted remote
          hosts and your userid on each host. Before trying to get
          FETCH to work remotely, make sure that you can rcp and
          rsh to the remote host.

          Obviously, remote execution of FETCH implies that FETCH
          must already be installed on the remote host. When FETCH
          runs another copy of FETCH remotely, it uses the -L option
          (findkey -L) to insure that the remote FETCH job executes,
          rather than calling yet another FETCH on another host.


          ---------- Remote execution on more than 1 host -----------
          If more than 1 remote host is available for running FINDKEY
          (say, in a clustered environment where many servers mount
          a common filesystem) the choice of a host can be determined
          by the csh script choosehost, such that execution of
          choosehost returns the name of a remote server. To use this
          approach, the following script, called 'choosehost' should
          be in your bin directory:

          #!/bin/csh
          # choosehost - choose a host to use for a remote job.
          # This script rotates among servers listed in .rexhosts,
          # by choosing the host at the top of the list and moving
          # it to the bottom.

          #Rotate the list, putting the current host to the bottom.
          set HOST = `head -1 $home/.rexhosts`
          set JOBID = $$
          tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID
          echo $HOST >> /tmp/.rexhosts.$JOBID
          /usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts

          # Write out the current host name
          echo $HOST

          You must also have a file in your home directory called
          .rexhosts, listing remote hosts, such as

          graucho.cc.umanitoba.ca
          harpo.cc.umanitoba.ca
          chico.cc.umanitoba.ca
          zeppo.cc.umanitoba.ca

          Each time choosehost is called, choosehost will rotate the
          names in the file. For example, starting with the .rexhosts
          as shown, it will move graucho.cc.umanitoba.ca to the bottom
          of the file, and write the line 'graucho.cc.umanitoba.ca'
          to the standard output. The next time choosehosts is
          run, it would write 'harpo.cc.umanitoba.ca', and so on.

          Depending on your local configuration, you may wish to
          rewrite choosehosts. All that is really necessary is that
          echo `choosehost` should return the name of a valid host.

          Once you have installed choosehost and tested it, you can
          get FINDKEY to use choosehost simply by setting

          setenv XYLEM_RHOST choosehost

          in your .cshrc file.

          ---------------  Remote filesystems -----------------------
          Finally, an alternative to remote execution is to remotely mount
          the file system containing the databases across the network.
          This has the advantage of simplicity, and means that the
          databases are available for ALL programs on your local
          workstation. However, it may still be advantageous to run
          FETCH remotely, since that will shift much of the computational
          load to another host.

    BUGS
          When retrieving entries directly from GenBank, FETCH uses the
          Accession Number index file gbacc.idx. In this case, FETCH
          can retrieve all entries containing a given accession number.
          This capability makes it possible to retrieve an entry using a
          secondary accession number. However if more than one entry
          share a secondary accession number, all of those entries will
          be retrieved. While this behavior might be a bit of an
          annoyance at times, it can also be useful because it alerts
          the user to the presence of other, related entries that might
          be of interest.

    SEE ALSO
            getloc features

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.