gde_linux/CORE/xylem/findkey.doc


      FINDKEY.DOC                                          update 13 Mar 97


      NAME
            findkey - finds database entries containg one or more keywords

      SYNOPSIS
            findkey
            findkey [-pvbmgrdutielnsaxzL] keywordfile [namefile findfile]
            findkey [-P PIR_dataset] keywordfile [namefile findfile]
            findkey [-G GenBank_dataset] keywordfile [namefile findfile]

      DESCRIPTION
	    findkey uses the grep family of commands to find lines in database
            annotation files containing one or more keywords. Next, identify
            is called to create a .nam file, containing the names of entries
            containing the keywords, and a .fnd file, containing the actual
            lines from each entry containing hits. A PIR or GenBank dataset is
            either a file containing one or more GenBank or PIR entries, or
            the name of a XYLEM dataset created by splitdb. See FILES below
            for a more detailed description.

      INTERACTIVE USE
      findkey prompts the user to set search parameters, using an interactive
      menu:

      ___________________________________________________________________
                     FINDKEY - Version  12 Aug 94
          Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
     ___________________________________________________________________
      Keyfile:
      Dataset:
      -------------------------------------------------------------------
         Parameter              Description                      Value
      -------------------------------------------------------------------
      1) Keyword     Keyword to find                               thionin
      2) Keyfile     Get list of keywords from Keyfile
      3) WhereToLook p:PIR   v:VecBase                                   p
                     GenBank - b:bacterial     i:invertebrate
                               m:mamalian      e:expressed seq. tag
                               g:phage         l:plant
                               r:primate       n:rna
                               d:rodent        s:synthetic
                               u:unannotated   a:viral
                               t:vertebrate    x:patented
                               z:STS
                     G: GenBank dataset        P: PIR dataset
         -------------------------------------------------------------
         Type number of your choice or 0 to continue:
      0
      Searching /home/psgendb/PIR/pir1.ano...
      Sequence names will be written to thionin~pir.nam
      Lines containing keyword(s) will be written to thionin~pir.fnd
      Searching /home/psgendb/PIR/pir2.ano...
      Sequence names will be written to thionin~pir.nam
      Lines containing keyword(s) will be written to thionin~pir.fnd
      Searching /home/psgendb/PIR/pir3.ano...
      Sequence names will be written to thionin~pir.nam
      Lines containing keyword(s) will be written to thionin~pir.fnd

      As shown in the example above, the keyword thionin was specified
      as the keyword to search for. By default, option 3 is set to p,
      and the PIR protein database is searched. Messages describe the
      progress of the search. Since PIR is broken up into two divisions
      (new and protein) both are searched, but all output is written to
      thionin.pir.nam and thionin.pir.fnd

      OPTIONS
            (1,2) Which keywords to search for?
            If you want to search for a single keyword, option 1 lets you type
            the keyword, without having to create a file. To search for more
            than one keyword, choose option 2, and specify the name of a
            file containing the keywords. For example, entries containing
            genes for antibiotic resistance might be found using the
            following keyword file:

            ampicillin
            chloramphenicol
            kanamycin
            neomycin
            tetracycline

            Note: keyword searches are case insensitive.

            As you might expect, it takes longer to search for multiple
            keywords than a single keyword.

            Options 1 & 2 are mutually exclusive. Setting one will negate the
            other. If option 2 is chosen, the name of the keyword file will
            appear at the top of the menu.

            Finally, it is probably not a good idea to search GenBank
            entries using very short keywords consisting only of letters.
            This is because GenBank entries now include a /translation
            field containing the amino acid sequence of each protein
            coding sequence. Consequently, 3 or 4 letter keywords
            consisting of legal amino acid symbols (eg. CAP, recA) will
            turn up fairly often in protein translations.

            (3) WhereToLook
            Use this option to specify the database to be searched In the
            case of GenBank, only one division at a time may be searched.
            User-created database subsets containing PIR (P) or GenBank (G)
            entries may also be searched. User-created database subsets
            must be in the .ano/.wrp/.ind form created by splitdb.

      OUTPUT
         The output filenames take the following form:

         name_ex1.ex2

         The 'name' part of the filename is either the keyword searched for,
         if option 1 was chosen, or the name of the keyword file,if option 2
         obtains. 'ex1' indicates the database division that was searched. For
         PIR and VecBase, ex1 is 'pir' and 'vec', respectively. For GenBank,
         ex1 is as follows:

            bct - bacterial
	    inv - invertebrate
	    mam - other mamalian
	    est - expressed sequence tag
	    phg - phage
	    pln - plant (includes fungi)
	    pri - primate
	    rna - structural RNAs
	    rod - rodent
	    syn - synthetic sequences
            sts - sequence tagged sites
	    una - unannotated (new) sequences
	    vrl - viral
	    vrt - other vertebrate

         'ex2' distinguishes the files containing the names of entries
         containing keywords (.nam) and the files containing the lines found
         in each entry (.fnd).

         The .nam file can be used directly as a namefile for fetch, getloc,
         or getob.

     COMMAND LINE USE

         OPTIONS
              p                  search PIR (default)
              P PIR dataset      search dbfile, containing PIR entries
              v                  search VecBase
              b                  search Genbank bacterial      division
              m                  search Genbank mamalian       division
              g                  search Genbank phage          division
              r                  search Genbank primate        division
              d                  search Genbank rodent         division
              u                  search Genbank unannotated    division
              t                  search Genbank vertebrate     division
              i                  search Genbank invertebrate   division
              l                  search Genbank plant          division
              n                  search Genbank rna            division
              s                  search Genbank synthetic      division
              a                  search Genbank viral          division
              x                  search Genbank patented       division
              e                  search Genbank exp.seq.tag    division
              z                  search GenBank STS            division
              S                  search GenBank Genom. Survey  division
              h                  search GenBank High Thrput.   division
              G GenBank dataset  search dbfile, containing GenBank entries

              L                  force execution of findkey on local host
                                 even if $XYLEM_RHOST is set. See "REMOTE
                                 EXECUTION" below

         FILES

         keywordfile - contains keywords to search for

         namefile - LOCUS names of hits are written to this file

         findfile - for each hit, a report listing the LOCUS name and the
            lines matching the keyword if written to this file.

         If namefile and findfile are not specified on the command line,
         filenames will be created as described above for interactive
         use.

         PIR_dataset
         GenBank_dataset
         This can be either a file of PIR entries, a file of GenBank entries,
         or a XYLEM dataset created by splitdb. A file of PIR entries must
         have the file extension ".pir". A file of GenBank entries must have
         the file extension ".gen". A XYLEM dataset contains PIR entries split
         among three files by splitdb: annotation (.ano), sequence (.wrp)
         and index (.ind). These file extensions must be used!

         When specifying a split dataset, only the base name needs to be
         used. For example given a XYLEM dataset consisting of the files
         myset.ano, myset.wrp and myset.ind, the following two commands
         are equivalent:

         findkey -P myset  something.kw
         findkey -P myset.ano something.kw

         If the original .pir file had been used, the command would have
         been

         findkey -P myset.pir something.kw

         The ability to work directly with .gen or .pir files is quite
         convenient. However, since FINDKEY needs to work with a split
         FINDKEY automatically splits .pir or .gen files into .ano, .wrp
         and .ind files, which are removed when finished. This requires
         extra disk space and execution time, which could be significant
         for large datasets.

     EXAMPLES
         If the list of antibiotics shown above was stored in the file
         antibiotic.kw, and option 3 was set to 'b', then the annotation
         portion of the GenBank bacterial division would be searched, and
         all lines containing any of these keywords would be written to
         antibiotic~bac.fnd. The corresponding GenBank entry names would
         appear in antibiotic~bac.nam.

         The same keyword file could be used to search other database files.
         If VecBase was searched, the output files would be antibiotic~vec.fnd
         and antibiotic~vec.nam. These filename conventions make it easy
         to search different database divisions, and to keep track of where
         data came from.

         Command line examples:

         findkey thionin.kw

         would be equivalent to the interactive example shown above. In
         this case, the file thionin.kw contains the word 'thionin'.
         (Note that since PIR is the default, -p need not be supplied.)

         findkey -b antibiotic.kw drugs.nam drugs.fnd

         would search the GenBank bacterial division for the keywords
         contained in antibiotic.kw, and write the output to drugs.nam
         and drugs.kw.

     FILES
        Database files:
          The directories for database files are specified by the environment
          variables $GB (GenBank) $PIR (PIR/NBRF) and $VEC(Vecbase).
          Annotation (.ano) and index (.ind) are those generated by splitdb.

        Temporary files:
          $jobid.fnd
          $jobid.nam
          $jobid.grep

          where $jobid is a unique jobid generated by the shell

     REMOTE EXECUTION
          Where the databases can not be stored locally, FINDKEY can call
          FINDKEY on another system and retrieve the results. To run
          FINDKEY remotely, your .cshrc file should contain the following
          lines:

          setenv XYLEM_RHOST remotehostname
          setenv XYLEM_USERID remoteuserid

          where remotehostname is the name of the host on which the
          databases reside (in XYLEM split format) and remoteuserid
          is your userid on the remote system.  When run remotely,
          your local copy of FINDKEY will generate the following
          commands:

          rcp filename $XYLEM_USERID@$XYLEM_HOST:filename
          rsh $XYLEM_RHOST -l $XYLEM_USERID findkey ...
          rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename
          rsh $XYLEM_RHOST -l $XYLEM_USERID rm temporary_files

          Because FINDKEY uses rsh and rcp, your home directory on both
          the local and remote systems must have a world-readable
          file called .rhosts, containing the names of trusted remote
          hosts and your userid on each host. Before trying to get
          FINDKEY to work remotely, make sure that you can rcp and
          rsh to the remote host.

          Obviously, remote execution of FINDKEY implies that FINDKEY
          must already be installed on the remote host. When FINDKEY
          runs another copy of FINDKEY remotely, it uses the -L option
          (findkey -L) to insure that the remote FINDKEY job executes,
          rather than calling yet another FINDKEY on another host.

          ---------- Remote execution on more than 1 host -----------
          If more than 1 remote host is available for running FINDKEY
          (say, in a clustered environment where many servers mount
          a common filesystem) the choice of a host can be determined
          by the csh script choosehost, such that execution of
          choosehost returns the name of a remote server. To use this
          approach, the following script, called 'choosehost' should
          be in your bin directory:

          #!/bin/csh
          # choosehost - choose a host to use for a remote job.
          # This script rotates among servers listed in .rexhosts,
          # by choosing the host at the top of the list and moving
          # it to the bottom.

          #Rotate the list, putting the current host to the bottom.
          set HOST = `head -1 $home/.rexhosts`
          set JOBID = $$
          tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID
          echo $HOST >> /tmp/.rexhosts.$JOBID
          /usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts

          # Write out the current host name
          echo $HOST

          You must also have a file in your home directory called
          .rexhosts, listing remote hosts, such as

          graucho.cc.umanitoba.ca
          harpo.cc.umanitoba.ca
          chico.cc.umanitoba.ca
          zeppo.cc.umanitoba.ca

          Each time choosehost is called, choosehost will rotate the
          names in the file. For example, starting with the .rexhosts
          as shown, it will move graucho.cc.umanitoba.ca to the bottom
          of the file, and write the line 'graucho.cc.umanitoba.ca'
          to the standard output. The next time choosehosts is
          run, it would write 'harpo.cc.umanitoba.ca', and so on.

          Depending on your local configuration, you may wish to
          rewrite choosehosts. All that is really necessary is that
          echo `choosehost` should return the name of a valid host.

          Once you have installed choosehost and tested it, you can
          get FINDKEY to use choosehost simply by setting

          setenv XYLEM_RHOST choosehost

          in your .cshrc file.

          ---------------  Remote filesystems -----------------------
          Finally, an alternative to remote execution is to remotely mount
          the file system containing the databases across the network.
          This has the advantage of simplicity, and means that the
          databases are available for ALL programs on your local
          workstation. However, it may still be advantageous to run
          XYLEM remotely, since that will shift much of the computational
          load to another host.


     BUGS
         At present, regular expression characters cannot be used for
         keyword searches.

      SEE ALSO
            grep(1V) identify splitdb

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.