365 lines
16 KiB
Text
365 lines
16 KiB
Text
|
|
FINDKEY.DOC update 13 Mar 97
|
|
|
|
|
|
NAME
|
|
findkey - finds database entries containg one or more keywords
|
|
|
|
SYNOPSIS
|
|
findkey
|
|
findkey [-pvbmgrdutielnsaxzL] keywordfile [namefile findfile]
|
|
findkey [-P PIR_dataset] keywordfile [namefile findfile]
|
|
findkey [-G GenBank_dataset] keywordfile [namefile findfile]
|
|
|
|
DESCRIPTION
|
|
findkey uses the grep family of commands to find lines in database
|
|
annotation files containing one or more keywords. Next, identify
|
|
is called to create a .nam file, containing the names of entries
|
|
containing the keywords, and a .fnd file, containing the actual
|
|
lines from each entry containing hits. A PIR or GenBank dataset is
|
|
either a file containing one or more GenBank or PIR entries, or
|
|
the name of a XYLEM dataset created by splitdb. See FILES below
|
|
for a more detailed description.
|
|
|
|
INTERACTIVE USE
|
|
findkey prompts the user to set search parameters, using an interactive
|
|
menu:
|
|
|
|
___________________________________________________________________
|
|
FINDKEY - Version 12 Aug 94
|
|
Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
|
|
___________________________________________________________________
|
|
Keyfile:
|
|
Dataset:
|
|
-------------------------------------------------------------------
|
|
Parameter Description Value
|
|
-------------------------------------------------------------------
|
|
1) Keyword Keyword to find thionin
|
|
2) Keyfile Get list of keywords from Keyfile
|
|
3) WhereToLook p:PIR v:VecBase p
|
|
GenBank - b:bacterial i:invertebrate
|
|
m:mamalian e:expressed seq. tag
|
|
g:phage l:plant
|
|
r:primate n:rna
|
|
d:rodent s:synthetic
|
|
u:unannotated a:viral
|
|
t:vertebrate x:patented
|
|
z:STS
|
|
G: GenBank dataset P: PIR dataset
|
|
-------------------------------------------------------------
|
|
Type number of your choice or 0 to continue:
|
|
0
|
|
Searching /home/psgendb/PIR/pir1.ano...
|
|
Sequence names will be written to thionin~pir.nam
|
|
Lines containing keyword(s) will be written to thionin~pir.fnd
|
|
Searching /home/psgendb/PIR/pir2.ano...
|
|
Sequence names will be written to thionin~pir.nam
|
|
Lines containing keyword(s) will be written to thionin~pir.fnd
|
|
Searching /home/psgendb/PIR/pir3.ano...
|
|
Sequence names will be written to thionin~pir.nam
|
|
Lines containing keyword(s) will be written to thionin~pir.fnd
|
|
|
|
As shown in the example above, the keyword thionin was specified
|
|
as the keyword to search for. By default, option 3 is set to p,
|
|
and the PIR protein database is searched. Messages describe the
|
|
progress of the search. Since PIR is broken up into two divisions
|
|
(new and protein) both are searched, but all output is written to
|
|
thionin.pir.nam and thionin.pir.fnd
|
|
|
|
OPTIONS
|
|
(1,2) Which keywords to search for?
|
|
If you want to search for a single keyword, option 1 lets you type
|
|
the keyword, without having to create a file. To search for more
|
|
than one keyword, choose option 2, and specify the name of a
|
|
file containing the keywords. For example, entries containing
|
|
genes for antibiotic resistance might be found using the
|
|
following keyword file:
|
|
|
|
ampicillin
|
|
chloramphenicol
|
|
kanamycin
|
|
neomycin
|
|
tetracycline
|
|
|
|
Note: keyword searches are case insensitive.
|
|
|
|
As you might expect, it takes longer to search for multiple
|
|
keywords than a single keyword.
|
|
|
|
Options 1 & 2 are mutually exclusive. Setting one will negate the
|
|
other. If option 2 is chosen, the name of the keyword file will
|
|
appear at the top of the menu.
|
|
|
|
Finally, it is probably not a good idea to search GenBank
|
|
entries using very short keywords consisting only of letters.
|
|
This is because GenBank entries now include a /translation
|
|
field containing the amino acid sequence of each protein
|
|
coding sequence. Consequently, 3 or 4 letter keywords
|
|
consisting of legal amino acid symbols (eg. CAP, recA) will
|
|
turn up fairly often in protein translations.
|
|
|
|
(3) WhereToLook
|
|
Use this option to specify the database to be searched In the
|
|
case of GenBank, only one division at a time may be searched.
|
|
User-created database subsets containing PIR (P) or GenBank (G)
|
|
entries may also be searched. User-created database subsets
|
|
must be in the .ano/.wrp/.ind form created by splitdb.
|
|
|
|
OUTPUT
|
|
The output filenames take the following form:
|
|
|
|
name_ex1.ex2
|
|
|
|
The 'name' part of the filename is either the keyword searched for,
|
|
if option 1 was chosen, or the name of the keyword file,if option 2
|
|
obtains. 'ex1' indicates the database division that was searched. For
|
|
PIR and VecBase, ex1 is 'pir' and 'vec', respectively. For GenBank,
|
|
ex1 is as follows:
|
|
|
|
bct - bacterial
|
|
inv - invertebrate
|
|
mam - other mamalian
|
|
est - expressed sequence tag
|
|
phg - phage
|
|
pln - plant (includes fungi)
|
|
pri - primate
|
|
rna - structural RNAs
|
|
rod - rodent
|
|
syn - synthetic sequences
|
|
sts - sequence tagged sites
|
|
una - unannotated (new) sequences
|
|
vrl - viral
|
|
vrt - other vertebrate
|
|
|
|
'ex2' distinguishes the files containing the names of entries
|
|
containing keywords (.nam) and the files containing the lines found
|
|
in each entry (.fnd).
|
|
|
|
The .nam file can be used directly as a namefile for fetch, getloc,
|
|
or getob.
|
|
|
|
COMMAND LINE USE
|
|
|
|
OPTIONS
|
|
p search PIR (default)
|
|
P PIR dataset search dbfile, containing PIR entries
|
|
v search VecBase
|
|
b search Genbank bacterial division
|
|
m search Genbank mamalian division
|
|
g search Genbank phage division
|
|
r search Genbank primate division
|
|
d search Genbank rodent division
|
|
u search Genbank unannotated division
|
|
t search Genbank vertebrate division
|
|
i search Genbank invertebrate division
|
|
l search Genbank plant division
|
|
n search Genbank rna division
|
|
s search Genbank synthetic division
|
|
a search Genbank viral division
|
|
x search Genbank patented division
|
|
e search Genbank exp.seq.tag division
|
|
z search GenBank STS division
|
|
S search GenBank Genom. Survey division
|
|
h search GenBank High Thrput. division
|
|
G GenBank dataset search dbfile, containing GenBank entries
|
|
|
|
L force execution of findkey on local host
|
|
even if $XYLEM_RHOST is set. See "REMOTE
|
|
EXECUTION" below
|
|
|
|
FILES
|
|
|
|
keywordfile - contains keywords to search for
|
|
|
|
namefile - LOCUS names of hits are written to this file
|
|
|
|
findfile - for each hit, a report listing the LOCUS name and the
|
|
lines matching the keyword if written to this file.
|
|
|
|
If namefile and findfile are not specified on the command line,
|
|
filenames will be created as described above for interactive
|
|
use.
|
|
|
|
PIR_dataset
|
|
GenBank_dataset
|
|
This can be either a file of PIR entries, a file of GenBank entries,
|
|
or a XYLEM dataset created by splitdb. A file of PIR entries must
|
|
have the file extension ".pir". A file of GenBank entries must have
|
|
the file extension ".gen". A XYLEM dataset contains PIR entries split
|
|
among three files by splitdb: annotation (.ano), sequence (.wrp)
|
|
and index (.ind). These file extensions must be used!
|
|
|
|
When specifying a split dataset, only the base name needs to be
|
|
used. For example given a XYLEM dataset consisting of the files
|
|
myset.ano, myset.wrp and myset.ind, the following two commands
|
|
are equivalent:
|
|
|
|
findkey -P myset something.kw
|
|
findkey -P myset.ano something.kw
|
|
|
|
If the original .pir file had been used, the command would have
|
|
been
|
|
|
|
findkey -P myset.pir something.kw
|
|
|
|
The ability to work directly with .gen or .pir files is quite
|
|
convenient. However, since FINDKEY needs to work with a split
|
|
FINDKEY automatically splits .pir or .gen files into .ano, .wrp
|
|
and .ind files, which are removed when finished. This requires
|
|
extra disk space and execution time, which could be significant
|
|
for large datasets.
|
|
|
|
EXAMPLES
|
|
If the list of antibiotics shown above was stored in the file
|
|
antibiotic.kw, and option 3 was set to 'b', then the annotation
|
|
portion of the GenBank bacterial division would be searched, and
|
|
all lines containing any of these keywords would be written to
|
|
antibiotic~bac.fnd. The corresponding GenBank entry names would
|
|
appear in antibiotic~bac.nam.
|
|
|
|
The same keyword file could be used to search other database files.
|
|
If VecBase was searched, the output files would be antibiotic~vec.fnd
|
|
and antibiotic~vec.nam. These filename conventions make it easy
|
|
to search different database divisions, and to keep track of where
|
|
data came from.
|
|
|
|
Command line examples:
|
|
|
|
findkey thionin.kw
|
|
|
|
would be equivalent to the interactive example shown above. In
|
|
this case, the file thionin.kw contains the word 'thionin'.
|
|
(Note that since PIR is the default, -p need not be supplied.)
|
|
|
|
findkey -b antibiotic.kw drugs.nam drugs.fnd
|
|
|
|
would search the GenBank bacterial division for the keywords
|
|
contained in antibiotic.kw, and write the output to drugs.nam
|
|
and drugs.kw.
|
|
|
|
FILES
|
|
Database files:
|
|
The directories for database files are specified by the environment
|
|
variables $GB (GenBank) $PIR (PIR/NBRF) and $VEC(Vecbase).
|
|
Annotation (.ano) and index (.ind) are those generated by splitdb.
|
|
|
|
Temporary files:
|
|
$jobid.fnd
|
|
$jobid.nam
|
|
$jobid.grep
|
|
|
|
where $jobid is a unique jobid generated by the shell
|
|
|
|
REMOTE EXECUTION
|
|
Where the databases can not be stored locally, FINDKEY can call
|
|
FINDKEY on another system and retrieve the results. To run
|
|
FINDKEY remotely, your .cshrc file should contain the following
|
|
lines:
|
|
|
|
setenv XYLEM_RHOST remotehostname
|
|
setenv XYLEM_USERID remoteuserid
|
|
|
|
where remotehostname is the name of the host on which the
|
|
databases reside (in XYLEM split format) and remoteuserid
|
|
is your userid on the remote system. When run remotely,
|
|
your local copy of FINDKEY will generate the following
|
|
commands:
|
|
|
|
rcp filename $XYLEM_USERID@$XYLEM_HOST:filename
|
|
rsh $XYLEM_RHOST -l $XYLEM_USERID findkey ...
|
|
rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename
|
|
rsh $XYLEM_RHOST -l $XYLEM_USERID rm temporary_files
|
|
|
|
Because FINDKEY uses rsh and rcp, your home directory on both
|
|
the local and remote systems must have a world-readable
|
|
file called .rhosts, containing the names of trusted remote
|
|
hosts and your userid on each host. Before trying to get
|
|
FINDKEY to work remotely, make sure that you can rcp and
|
|
rsh to the remote host.
|
|
|
|
Obviously, remote execution of FINDKEY implies that FINDKEY
|
|
must already be installed on the remote host. When FINDKEY
|
|
runs another copy of FINDKEY remotely, it uses the -L option
|
|
(findkey -L) to insure that the remote FINDKEY job executes,
|
|
rather than calling yet another FINDKEY on another host.
|
|
|
|
---------- Remote execution on more than 1 host -----------
|
|
If more than 1 remote host is available for running FINDKEY
|
|
(say, in a clustered environment where many servers mount
|
|
a common filesystem) the choice of a host can be determined
|
|
by the csh script choosehost, such that execution of
|
|
choosehost returns the name of a remote server. To use this
|
|
approach, the following script, called 'choosehost' should
|
|
be in your bin directory:
|
|
|
|
#!/bin/csh
|
|
# choosehost - choose a host to use for a remote job.
|
|
# This script rotates among servers listed in .rexhosts,
|
|
# by choosing the host at the top of the list and moving
|
|
# it to the bottom.
|
|
|
|
#Rotate the list, putting the current host to the bottom.
|
|
set HOST = `head -1 $home/.rexhosts`
|
|
set JOBID = $$
|
|
tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID
|
|
echo $HOST >> /tmp/.rexhosts.$JOBID
|
|
/usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts
|
|
|
|
# Write out the current host name
|
|
echo $HOST
|
|
|
|
You must also have a file in your home directory called
|
|
.rexhosts, listing remote hosts, such as
|
|
|
|
graucho.cc.umanitoba.ca
|
|
harpo.cc.umanitoba.ca
|
|
chico.cc.umanitoba.ca
|
|
zeppo.cc.umanitoba.ca
|
|
|
|
Each time choosehost is called, choosehost will rotate the
|
|
names in the file. For example, starting with the .rexhosts
|
|
as shown, it will move graucho.cc.umanitoba.ca to the bottom
|
|
of the file, and write the line 'graucho.cc.umanitoba.ca'
|
|
to the standard output. The next time choosehosts is
|
|
run, it would write 'harpo.cc.umanitoba.ca', and so on.
|
|
|
|
Depending on your local configuration, you may wish to
|
|
rewrite choosehosts. All that is really necessary is that
|
|
echo `choosehost` should return the name of a valid host.
|
|
|
|
Once you have installed choosehost and tested it, you can
|
|
get FINDKEY to use choosehost simply by setting
|
|
|
|
setenv XYLEM_RHOST choosehost
|
|
|
|
in your .cshrc file.
|
|
|
|
--------------- Remote filesystems -----------------------
|
|
Finally, an alternative to remote execution is to remotely mount
|
|
the file system containing the databases across the network.
|
|
This has the advantage of simplicity, and means that the
|
|
databases are available for ALL programs on your local
|
|
workstation. However, it may still be advantageous to run
|
|
XYLEM remotely, since that will shift much of the computational
|
|
load to another host.
|
|
|
|
|
|
BUGS
|
|
At present, regular expression characters cannot be used for
|
|
keyword searches.
|
|
|
|
SEE ALSO
|
|
grep(1V) identify splitdb
|
|
|
|
AUTHOR
|
|
Dr. Brian Fristensky
|
|
Dept. of Plant Science
|
|
University of Manitoba
|
|
Winnipeg, MB Canada R3T 2N2
|
|
Phone: 204-474-6085
|
|
FAX: 204-261-5732
|
|
frist@cc.umanitoba.ca
|
|
|
|
REFERENCE
|
|
Fristensky, B. (1993) Feature expressions: creating and manipulating
|
|
sequence datasets. Nucleic Acids Research 21:5997-6003.
|