FINDKEY.DOC update 13 Mar 97 NAME findkey - finds database entries containg one or more keywords SYNOPSIS findkey findkey [-pvbmgrdutielnsaxzL] keywordfile [namefile findfile] findkey [-P PIR_dataset] keywordfile [namefile findfile] findkey [-G GenBank_dataset] keywordfile [namefile findfile] DESCRIPTION findkey uses the grep family of commands to find lines in database annotation files containing one or more keywords. Next, identify is called to create a .nam file, containing the names of entries containing the keywords, and a .fnd file, containing the actual lines from each entry containing hits. A PIR or GenBank dataset is either a file containing one or more GenBank or PIR entries, or the name of a XYLEM dataset created by splitdb. See FILES below for a more detailed description. INTERACTIVE USE findkey prompts the user to set search parameters, using an interactive menu: ___________________________________________________________________ FINDKEY - Version 12 Aug 94 Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003 ___________________________________________________________________ Keyfile: Dataset: ------------------------------------------------------------------- Parameter Description Value ------------------------------------------------------------------- 1) Keyword Keyword to find thionin 2) Keyfile Get list of keywords from Keyfile 3) WhereToLook p:PIR v:VecBase p GenBank - b:bacterial i:invertebrate m:mamalian e:expressed seq. tag g:phage l:plant r:primate n:rna d:rodent s:synthetic u:unannotated a:viral t:vertebrate x:patented z:STS G: GenBank dataset P: PIR dataset ------------------------------------------------------------- Type number of your choice or 0 to continue: 0 Searching /home/psgendb/PIR/pir1.ano... Sequence names will be written to thionin~pir.nam Lines containing keyword(s) will be written to thionin~pir.fnd Searching /home/psgendb/PIR/pir2.ano... Sequence names will be written to thionin~pir.nam Lines containing keyword(s) will be written to thionin~pir.fnd Searching /home/psgendb/PIR/pir3.ano... Sequence names will be written to thionin~pir.nam Lines containing keyword(s) will be written to thionin~pir.fnd As shown in the example above, the keyword thionin was specified as the keyword to search for. By default, option 3 is set to p, and the PIR protein database is searched. Messages describe the progress of the search. Since PIR is broken up into two divisions (new and protein) both are searched, but all output is written to thionin.pir.nam and thionin.pir.fnd OPTIONS (1,2) Which keywords to search for? If you want to search for a single keyword, option 1 lets you type the keyword, without having to create a file. To search for more than one keyword, choose option 2, and specify the name of a file containing the keywords. For example, entries containing genes for antibiotic resistance might be found using the following keyword file: ampicillin chloramphenicol kanamycin neomycin tetracycline Note: keyword searches are case insensitive. As you might expect, it takes longer to search for multiple keywords than a single keyword. Options 1 & 2 are mutually exclusive. Setting one will negate the other. If option 2 is chosen, the name of the keyword file will appear at the top of the menu. Finally, it is probably not a good idea to search GenBank entries using very short keywords consisting only of letters. This is because GenBank entries now include a /translation field containing the amino acid sequence of each protein coding sequence. Consequently, 3 or 4 letter keywords consisting of legal amino acid symbols (eg. CAP, recA) will turn up fairly often in protein translations. (3) WhereToLook Use this option to specify the database to be searched In the case of GenBank, only one division at a time may be searched. User-created database subsets containing PIR (P) or GenBank (G) entries may also be searched. User-created database subsets must be in the .ano/.wrp/.ind form created by splitdb. OUTPUT The output filenames take the following form: name_ex1.ex2 The 'name' part of the filename is either the keyword searched for, if option 1 was chosen, or the name of the keyword file,if option 2 obtains. 'ex1' indicates the database division that was searched. For PIR and VecBase, ex1 is 'pir' and 'vec', respectively. For GenBank, ex1 is as follows: bct - bacterial inv - invertebrate mam - other mamalian est - expressed sequence tag phg - phage pln - plant (includes fungi) pri - primate rna - structural RNAs rod - rodent syn - synthetic sequences sts - sequence tagged sites una - unannotated (new) sequences vrl - viral vrt - other vertebrate 'ex2' distinguishes the files containing the names of entries containing keywords (.nam) and the files containing the lines found in each entry (.fnd). The .nam file can be used directly as a namefile for fetch, getloc, or getob. COMMAND LINE USE OPTIONS p search PIR (default) P PIR dataset search dbfile, containing PIR entries v search VecBase b search Genbank bacterial division m search Genbank mamalian division g search Genbank phage division r search Genbank primate division d search Genbank rodent division u search Genbank unannotated division t search Genbank vertebrate division i search Genbank invertebrate division l search Genbank plant division n search Genbank rna division s search Genbank synthetic division a search Genbank viral division x search Genbank patented division e search Genbank exp.seq.tag division z search GenBank STS division S search GenBank Genom. Survey division h search GenBank High Thrput. division G GenBank dataset search dbfile, containing GenBank entries L force execution of findkey on local host even if $XYLEM_RHOST is set. See "REMOTE EXECUTION" below FILES keywordfile - contains keywords to search for namefile - LOCUS names of hits are written to this file findfile - for each hit, a report listing the LOCUS name and the lines matching the keyword if written to this file. If namefile and findfile are not specified on the command line, filenames will be created as described above for interactive use. PIR_dataset GenBank_dataset This can be either a file of PIR entries, a file of GenBank entries, or a XYLEM dataset created by splitdb. A file of PIR entries must have the file extension ".pir". A file of GenBank entries must have the file extension ".gen". A XYLEM dataset contains PIR entries split among three files by splitdb: annotation (.ano), sequence (.wrp) and index (.ind). These file extensions must be used! When specifying a split dataset, only the base name needs to be used. For example given a XYLEM dataset consisting of the files myset.ano, myset.wrp and myset.ind, the following two commands are equivalent: findkey -P myset something.kw findkey -P myset.ano something.kw If the original .pir file had been used, the command would have been findkey -P myset.pir something.kw The ability to work directly with .gen or .pir files is quite convenient. However, since FINDKEY needs to work with a split FINDKEY automatically splits .pir or .gen files into .ano, .wrp and .ind files, which are removed when finished. This requires extra disk space and execution time, which could be significant for large datasets. EXAMPLES If the list of antibiotics shown above was stored in the file antibiotic.kw, and option 3 was set to 'b', then the annotation portion of the GenBank bacterial division would be searched, and all lines containing any of these keywords would be written to antibiotic~bac.fnd. The corresponding GenBank entry names would appear in antibiotic~bac.nam. The same keyword file could be used to search other database files. If VecBase was searched, the output files would be antibiotic~vec.fnd and antibiotic~vec.nam. These filename conventions make it easy to search different database divisions, and to keep track of where data came from. Command line examples: findkey thionin.kw would be equivalent to the interactive example shown above. In this case, the file thionin.kw contains the word 'thionin'. (Note that since PIR is the default, -p need not be supplied.) findkey -b antibiotic.kw drugs.nam drugs.fnd would search the GenBank bacterial division for the keywords contained in antibiotic.kw, and write the output to drugs.nam and drugs.kw. FILES Database files: The directories for database files are specified by the environment variables $GB (GenBank) $PIR (PIR/NBRF) and $VEC(Vecbase). Annotation (.ano) and index (.ind) are those generated by splitdb. Temporary files: $jobid.fnd $jobid.nam $jobid.grep where $jobid is a unique jobid generated by the shell REMOTE EXECUTION Where the databases can not be stored locally, FINDKEY can call FINDKEY on another system and retrieve the results. To run FINDKEY remotely, your .cshrc file should contain the following lines: setenv XYLEM_RHOST remotehostname setenv XYLEM_USERID remoteuserid where remotehostname is the name of the host on which the databases reside (in XYLEM split format) and remoteuserid is your userid on the remote system. When run remotely, your local copy of FINDKEY will generate the following commands: rcp filename $XYLEM_USERID@$XYLEM_HOST:filename rsh $XYLEM_RHOST -l $XYLEM_USERID findkey ... rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename rsh $XYLEM_RHOST -l $XYLEM_USERID rm temporary_files Because FINDKEY uses rsh and rcp, your home directory on both the local and remote systems must have a world-readable file called .rhosts, containing the names of trusted remote hosts and your userid on each host. Before trying to get FINDKEY to work remotely, make sure that you can rcp and rsh to the remote host. Obviously, remote execution of FINDKEY implies that FINDKEY must already be installed on the remote host. When FINDKEY runs another copy of FINDKEY remotely, it uses the -L option (findkey -L) to insure that the remote FINDKEY job executes, rather than calling yet another FINDKEY on another host. ---------- Remote execution on more than 1 host ----------- If more than 1 remote host is available for running FINDKEY (say, in a clustered environment where many servers mount a common filesystem) the choice of a host can be determined by the csh script choosehost, such that execution of choosehost returns the name of a remote server. To use this approach, the following script, called 'choosehost' should be in your bin directory: #!/bin/csh # choosehost - choose a host to use for a remote job. # This script rotates among servers listed in .rexhosts, # by choosing the host at the top of the list and moving # it to the bottom. #Rotate the list, putting the current host to the bottom. set HOST = `head -1 $home/.rexhosts` set JOBID = $$ tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID echo $HOST >> /tmp/.rexhosts.$JOBID /usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts # Write out the current host name echo $HOST You must also have a file in your home directory called .rexhosts, listing remote hosts, such as graucho.cc.umanitoba.ca harpo.cc.umanitoba.ca chico.cc.umanitoba.ca zeppo.cc.umanitoba.ca Each time choosehost is called, choosehost will rotate the names in the file. For example, starting with the .rexhosts as shown, it will move graucho.cc.umanitoba.ca to the bottom of the file, and write the line 'graucho.cc.umanitoba.ca' to the standard output. The next time choosehosts is run, it would write 'harpo.cc.umanitoba.ca', and so on. Depending on your local configuration, you may wish to rewrite choosehosts. All that is really necessary is that echo `choosehost` should return the name of a valid host. Once you have installed choosehost and tested it, you can get FINDKEY to use choosehost simply by setting setenv XYLEM_RHOST choosehost in your .cshrc file. --------------- Remote filesystems ----------------------- Finally, an alternative to remote execution is to remotely mount the file system containing the databases across the network. This has the advantage of simplicity, and means that the databases are available for ALL programs on your local workstation. However, it may still be advantageous to run XYLEM remotely, since that will shift much of the computational load to another host. BUGS At present, regular expression characters cannot be used for keyword searches. SEE ALSO grep(1V) identify splitdb AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-261-5732 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.