gde_linux/CORE/xylem/findkey.doc

366 lines
16 KiB
Plaintext

FINDKEY.DOC update 13 Mar 97
NAME
findkey - finds database entries containg one or more keywords
SYNOPSIS
findkey
findkey [-pvbmgrdutielnsaxzL] keywordfile [namefile findfile]
findkey [-P PIR_dataset] keywordfile [namefile findfile]
findkey [-G GenBank_dataset] keywordfile [namefile findfile]
DESCRIPTION
findkey uses the grep family of commands to find lines in database
annotation files containing one or more keywords. Next, identify
is called to create a .nam file, containing the names of entries
containing the keywords, and a .fnd file, containing the actual
lines from each entry containing hits. A PIR or GenBank dataset is
either a file containing one or more GenBank or PIR entries, or
the name of a XYLEM dataset created by splitdb. See FILES below
for a more detailed description.
INTERACTIVE USE
findkey prompts the user to set search parameters, using an interactive
menu:
___________________________________________________________________
FINDKEY - Version 12 Aug 94
Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
___________________________________________________________________
Keyfile:
Dataset:
-------------------------------------------------------------------
Parameter Description Value
-------------------------------------------------------------------
1) Keyword Keyword to find thionin
2) Keyfile Get list of keywords from Keyfile
3) WhereToLook p:PIR v:VecBase p
GenBank - b:bacterial i:invertebrate
m:mamalian e:expressed seq. tag
g:phage l:plant
r:primate n:rna
d:rodent s:synthetic
u:unannotated a:viral
t:vertebrate x:patented
z:STS
G: GenBank dataset P: PIR dataset
-------------------------------------------------------------
Type number of your choice or 0 to continue:
0
Searching /home/psgendb/PIR/pir1.ano...
Sequence names will be written to thionin~pir.nam
Lines containing keyword(s) will be written to thionin~pir.fnd
Searching /home/psgendb/PIR/pir2.ano...
Sequence names will be written to thionin~pir.nam
Lines containing keyword(s) will be written to thionin~pir.fnd
Searching /home/psgendb/PIR/pir3.ano...
Sequence names will be written to thionin~pir.nam
Lines containing keyword(s) will be written to thionin~pir.fnd
As shown in the example above, the keyword thionin was specified
as the keyword to search for. By default, option 3 is set to p,
and the PIR protein database is searched. Messages describe the
progress of the search. Since PIR is broken up into two divisions
(new and protein) both are searched, but all output is written to
thionin.pir.nam and thionin.pir.fnd
OPTIONS
(1,2) Which keywords to search for?
If you want to search for a single keyword, option 1 lets you type
the keyword, without having to create a file. To search for more
than one keyword, choose option 2, and specify the name of a
file containing the keywords. For example, entries containing
genes for antibiotic resistance might be found using the
following keyword file:
ampicillin
chloramphenicol
kanamycin
neomycin
tetracycline
Note: keyword searches are case insensitive.
As you might expect, it takes longer to search for multiple
keywords than a single keyword.
Options 1 & 2 are mutually exclusive. Setting one will negate the
other. If option 2 is chosen, the name of the keyword file will
appear at the top of the menu.
Finally, it is probably not a good idea to search GenBank
entries using very short keywords consisting only of letters.
This is because GenBank entries now include a /translation
field containing the amino acid sequence of each protein
coding sequence. Consequently, 3 or 4 letter keywords
consisting of legal amino acid symbols (eg. CAP, recA) will
turn up fairly often in protein translations.
(3) WhereToLook
Use this option to specify the database to be searched In the
case of GenBank, only one division at a time may be searched.
User-created database subsets containing PIR (P) or GenBank (G)
entries may also be searched. User-created database subsets
must be in the .ano/.wrp/.ind form created by splitdb.
OUTPUT
The output filenames take the following form:
name_ex1.ex2
The 'name' part of the filename is either the keyword searched for,
if option 1 was chosen, or the name of the keyword file,if option 2
obtains. 'ex1' indicates the database division that was searched. For
PIR and VecBase, ex1 is 'pir' and 'vec', respectively. For GenBank,
ex1 is as follows:
bct - bacterial
inv - invertebrate
mam - other mamalian
est - expressed sequence tag
phg - phage
pln - plant (includes fungi)
pri - primate
rna - structural RNAs
rod - rodent
syn - synthetic sequences
sts - sequence tagged sites
una - unannotated (new) sequences
vrl - viral
vrt - other vertebrate
'ex2' distinguishes the files containing the names of entries
containing keywords (.nam) and the files containing the lines found
in each entry (.fnd).
The .nam file can be used directly as a namefile for fetch, getloc,
or getob.
COMMAND LINE USE
OPTIONS
p search PIR (default)
P PIR dataset search dbfile, containing PIR entries
v search VecBase
b search Genbank bacterial division
m search Genbank mamalian division
g search Genbank phage division
r search Genbank primate division
d search Genbank rodent division
u search Genbank unannotated division
t search Genbank vertebrate division
i search Genbank invertebrate division
l search Genbank plant division
n search Genbank rna division
s search Genbank synthetic division
a search Genbank viral division
x search Genbank patented division
e search Genbank exp.seq.tag division
z search GenBank STS division
S search GenBank Genom. Survey division
h search GenBank High Thrput. division
G GenBank dataset search dbfile, containing GenBank entries
L force execution of findkey on local host
even if $XYLEM_RHOST is set. See "REMOTE
EXECUTION" below
FILES
keywordfile - contains keywords to search for
namefile - LOCUS names of hits are written to this file
findfile - for each hit, a report listing the LOCUS name and the
lines matching the keyword if written to this file.
If namefile and findfile are not specified on the command line,
filenames will be created as described above for interactive
use.
PIR_dataset
GenBank_dataset
This can be either a file of PIR entries, a file of GenBank entries,
or a XYLEM dataset created by splitdb. A file of PIR entries must
have the file extension ".pir". A file of GenBank entries must have
the file extension ".gen". A XYLEM dataset contains PIR entries split
among three files by splitdb: annotation (.ano), sequence (.wrp)
and index (.ind). These file extensions must be used!
When specifying a split dataset, only the base name needs to be
used. For example given a XYLEM dataset consisting of the files
myset.ano, myset.wrp and myset.ind, the following two commands
are equivalent:
findkey -P myset something.kw
findkey -P myset.ano something.kw
If the original .pir file had been used, the command would have
been
findkey -P myset.pir something.kw
The ability to work directly with .gen or .pir files is quite
convenient. However, since FINDKEY needs to work with a split
FINDKEY automatically splits .pir or .gen files into .ano, .wrp
and .ind files, which are removed when finished. This requires
extra disk space and execution time, which could be significant
for large datasets.
EXAMPLES
If the list of antibiotics shown above was stored in the file
antibiotic.kw, and option 3 was set to 'b', then the annotation
portion of the GenBank bacterial division would be searched, and
all lines containing any of these keywords would be written to
antibiotic~bac.fnd. The corresponding GenBank entry names would
appear in antibiotic~bac.nam.
The same keyword file could be used to search other database files.
If VecBase was searched, the output files would be antibiotic~vec.fnd
and antibiotic~vec.nam. These filename conventions make it easy
to search different database divisions, and to keep track of where
data came from.
Command line examples:
findkey thionin.kw
would be equivalent to the interactive example shown above. In
this case, the file thionin.kw contains the word 'thionin'.
(Note that since PIR is the default, -p need not be supplied.)
findkey -b antibiotic.kw drugs.nam drugs.fnd
would search the GenBank bacterial division for the keywords
contained in antibiotic.kw, and write the output to drugs.nam
and drugs.kw.
FILES
Database files:
The directories for database files are specified by the environment
variables $GB (GenBank) $PIR (PIR/NBRF) and $VEC(Vecbase).
Annotation (.ano) and index (.ind) are those generated by splitdb.
Temporary files:
$jobid.fnd
$jobid.nam
$jobid.grep
where $jobid is a unique jobid generated by the shell
REMOTE EXECUTION
Where the databases can not be stored locally, FINDKEY can call
FINDKEY on another system and retrieve the results. To run
FINDKEY remotely, your .cshrc file should contain the following
lines:
setenv XYLEM_RHOST remotehostname
setenv XYLEM_USERID remoteuserid
where remotehostname is the name of the host on which the
databases reside (in XYLEM split format) and remoteuserid
is your userid on the remote system. When run remotely,
your local copy of FINDKEY will generate the following
commands:
rcp filename $XYLEM_USERID@$XYLEM_HOST:filename
rsh $XYLEM_RHOST -l $XYLEM_USERID findkey ...
rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename
rsh $XYLEM_RHOST -l $XYLEM_USERID rm temporary_files
Because FINDKEY uses rsh and rcp, your home directory on both
the local and remote systems must have a world-readable
file called .rhosts, containing the names of trusted remote
hosts and your userid on each host. Before trying to get
FINDKEY to work remotely, make sure that you can rcp and
rsh to the remote host.
Obviously, remote execution of FINDKEY implies that FINDKEY
must already be installed on the remote host. When FINDKEY
runs another copy of FINDKEY remotely, it uses the -L option
(findkey -L) to insure that the remote FINDKEY job executes,
rather than calling yet another FINDKEY on another host.
---------- Remote execution on more than 1 host -----------
If more than 1 remote host is available for running FINDKEY
(say, in a clustered environment where many servers mount
a common filesystem) the choice of a host can be determined
by the csh script choosehost, such that execution of
choosehost returns the name of a remote server. To use this
approach, the following script, called 'choosehost' should
be in your bin directory:
#!/bin/csh
# choosehost - choose a host to use for a remote job.
# This script rotates among servers listed in .rexhosts,
# by choosing the host at the top of the list and moving
# it to the bottom.
#Rotate the list, putting the current host to the bottom.
set HOST = `head -1 $home/.rexhosts`
set JOBID = $$
tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID
echo $HOST >> /tmp/.rexhosts.$JOBID
/usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts
# Write out the current host name
echo $HOST
You must also have a file in your home directory called
.rexhosts, listing remote hosts, such as
graucho.cc.umanitoba.ca
harpo.cc.umanitoba.ca
chico.cc.umanitoba.ca
zeppo.cc.umanitoba.ca
Each time choosehost is called, choosehost will rotate the
names in the file. For example, starting with the .rexhosts
as shown, it will move graucho.cc.umanitoba.ca to the bottom
of the file, and write the line 'graucho.cc.umanitoba.ca'
to the standard output. The next time choosehosts is
run, it would write 'harpo.cc.umanitoba.ca', and so on.
Depending on your local configuration, you may wish to
rewrite choosehosts. All that is really necessary is that
echo `choosehost` should return the name of a valid host.
Once you have installed choosehost and tested it, you can
get FINDKEY to use choosehost simply by setting
setenv XYLEM_RHOST choosehost
in your .cshrc file.
--------------- Remote filesystems -----------------------
Finally, an alternative to remote execution is to remotely mount
the file system containing the databases across the network.
This has the advantage of simplicity, and means that the
databases are available for ALL programs on your local
workstation. However, it may still be advantageous to run
XYLEM remotely, since that will shift much of the computational
load to another host.
BUGS
At present, regular expression characters cannot be used for
keyword searches.
SEE ALSO
grep(1V) identify splitdb
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca
REFERENCE
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.