gde_linux/CORE/xylem/fetch.doc

321 lines
14 KiB
Plaintext

FETCH.DOC update 24 Feb 96
NAME
fetch - retrieves database entries by name or accession number
SYNOPSIS
fetch {interactive mode}
fetch [options] namefile [output file] {batch mode}
DESCRIPTION
fetch retrieves one or more entries from a database.
Interactive mode: fetch prompts the user to set search parameters,
using an interactive menu:
___________________________________________________________________
FETCH - Version 7 Feb 94
Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
___________________________________________________________________
Namefile:
Outfile:
Database:
-------------------------------------------------------------------
Parameter Description Value
1) Name/Acc Name or Accession sequence to get
2) Namefile Get list of sequences from Namefile
3) WhatToGet a:annotation s:sequence b:both b
4) Database g:GenBank p:PIR v:VecBase l:LiMB g
G:GenBank dataset P:PIR dataset
5) Outfile Send all output to a single file (Outfile)
6) Files f:Send each entry to a separate file f
-------------------------------------------------------------
Type number of your choice or 0 to continue:
After all parameters have been set, type 0 to commence the search.
Messages regarding the progress of the search will be printed.
(1,2) Which entries to get?
If you want to get a single entry, option 1 lets you type in the
name of that entry, without having to create a namefile. To get
more than one entry, choose option 2, and specify the name of a
file containing sequence names or accession numbers.
namefile is a file containing one or more sequence names or
accession numbers, each on a separate line. Names and accession
numbers can even be interspersed, in upper or lowercase, and in
any order. For example, the namefile prp.nam might contain
; plant pathogenesis related proteins
; (these are sample comment lines)
; note that any line containing a semicolon is ignored
x06362
x05454
TOBPR1A1
; comments can be interspersed with names.
PUMPR13
tobpr1ar
Options 1 & 2 are mutually exclusive. Setting one will negate the
other. If option 2 is chosen, the name of the namefile will appear
at the top of the menu.
(3) WhatToGet
Use this option to specify whether to get annotation, sequence,
or both (default=both).
(4) Database
Use this option to select the database. (default=GenBank).
G and P select user-created database subsets containing GenBank
or PIR entries, respectively. It is assumed that the database
has been split into .ano, .wrp and .ind files using splitdb.
For example, if you had created a database subset called PR1.pir,
splitdb would create PR1.ano, PR1.wrp and PR1.ind. These are
the files actually read by FETCH. When prompted for the name
of the database, simply type "PR1", without a file extension.
(If you do type a file extension, it will be ignored).
(5, 6) Where to send output
By default, option 6 is set to f, and each entry will be written to
a separate file, where the name of the file is the name of the
entry, followed by a file extension. If a complete entry is
retrieved, the file extension will indicate the type of database
(GenBank: .gen; PIR: .pir, Vecbase: .vec; LiMB: .LiMB). If only
annotation or sequence are retrieved, the file extensions will be
.ano or .wrp, respectively. Using the default, the namefile above
would create the following files:
PUMPR13.gen
TOBPR1A1.gen
TOBPR1AR.gen
TOBPR1CR.gen
TOBPR1PS.gen
By choosing option 5, you can specify the name of an output file
for all entries to go to. This filename will appear at the top
of the menu. Obviously, options 5 & 6 are mutually exclusive.
Note entries retrieved are writen in alphabetical order (sorting by
ASCII values), not the order in which they appeared in namefile.
(Note for remote users only: -f will only work for a single
name/accession supplied in 1). -f IS NOT ENABLED FOR NAMEFILES
specified in 2).)
Batch mode:
Although it is transparent to the user, all fetch really does
is call getloc, saving the user the trouble of knowing which
database files to retrieve sequences from, or of having to
execute getloc multiple times to retrieve sequences from
different database files. Thus, the options are identical to those
for getloc:
-a Write annotation portions of entries only, terminated by '//'.
-s Write sequence data only, in Pearson (.wrp) format.
-f Write each entry to a separate file.
-g GenBank (default)
-e EMBL {not implemented}
-p PIR (NBRF)
-v Vecbase
-l LiMB
-G GenBank_dataset
-P PIR_dataset
If -f is not specified, outfile must be specified.
-L force execution of findkey on local host even if
$XYLEM_RHOST is set. See "REMOTE EXECUTION" below
PIR_dataset
GenBank_dataset
This can be either a file of PIR entries, a file of GenBank entries,
or a XYLEM dataset created by splitdb. A file of PIR entries must
have the file extension ".pir". A file of GenBank entries must have
the file extension ".gen". A XYLEM dataset contains PIR entries split
among three files by splitdb: annotation (.ano), sequence (.wrp)
and index (.ind). These file extensions must be used!
When specifying a split dataset, only the base name needs to be
used. For example given a XYLEM dataset consisting of the files
myset.ano, myset.wrp and myset.ind, the following two commands
are equivalent:
fetch -P myset something.nam something.pir
fetch -P myset.ano something.nam something.pir
If the original .pir file had been used, the command would have
been
fetch -P myset.pir something.nam something.pir
The ability to work directly with .gen or .pir files is quite
convenient. However, since FETCH needs to work with a split
FETCH automatically splits .pir or .gen files into .ano, .wrp
and .ind files, which are removed when finished. This requires
extra disk space and execution time, which could be significant
for large datasets.
EXAMPLES
Batch example:
fetch -f chitinase.nam
will retrieve annotation and sequence for sequences listed in
chitinase.nam from GenBank, writing each entry to a separate file
with the extension .gen.
fetch -s -v pbr.nam pbr.wrp
will retrieve sequence data only for the entries listed in pbr.nam,
from VecBase, and write all sequences to a Pearson format file
(ie. readable by fasta) with the name pbr.wrp.
fetch -G sample sample.nam new.gen
fetch -G sample.ano sample.nam new.gen
Assumes that a set of GenBank entries has been split by splitdb
into sample.ano sample.wrp and sample.ind. The entries listed in
sample.nam are written to new.gen.
FILES
Database files:
The directories for database files are specified by the environment
variables $GB (GenBank) $PIR (PIR/NBRF) $VEC(Vecbase) and $LIMB
(LiMB).
Index files are $GB/gbacc.idx for GenBank (this file is supplied
with each GenBank release), while the other databases
use .ind files generated by splitdb. Split database files MUST
have the following file extensions: .ano {annotation}, .wrp
{sequence} and .ind {index}. Thus, when creating database files
for pir1.dat with splitdb, the output files should be pir1.ano,
pir1.wrp and pir1.ind.
Temporary files:
NAMEFILE.fetch
PRELIMINARY.fetch
TMP.fetch
FOUND.fetch
FETCHDIR {temporary directory}
REMOTE EXECUTION
Where the databases can not be stored locally, FETCH can call
FETCH on another system and retrieve the results. To run
FETCH remotely, your .cshrc file should contain the following
lines:
setenv XYLEM_RHOST remotehostname
setenv XYLEM_USERID remoteuserid
where remotehostname is the name of the host on which the
databases reside (in XYLEM split format) and remoteuserid
is your userid on the remote system. When run remotely,
your local copy of FETCH will generate the following
commands:
rcp filename $XYLEM_USERID@$XYLEM_HOST:filename
rsh $XYLEM_RHOST -l $XYLEM_USERID fetch ...
rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename
rsh $XYLEM_RHOST -l $XYLEM_USERID $RM temporary_files
Because FETCH uses rsh and rcp, your home directory on both
the local and remote systems must have a world-readable
file called .rhosts, containing the names of trusted remote
hosts and your userid on each host. Before trying to get
FETCH to work remotely, make sure that you can rcp and
rsh to the remote host.
Obviously, remote execution of FETCH implies that FETCH
must already be installed on the remote host. When FETCH
runs another copy of FETCH remotely, it uses the -L option
(findkey -L) to insure that the remote FETCH job executes,
rather than calling yet another FETCH on another host.
---------- Remote execution on more than 1 host -----------
If more than 1 remote host is available for running FINDKEY
(say, in a clustered environment where many servers mount
a common filesystem) the choice of a host can be determined
by the csh script choosehost, such that execution of
choosehost returns the name of a remote server. To use this
approach, the following script, called 'choosehost' should
be in your bin directory:
#!/bin/csh
# choosehost - choose a host to use for a remote job.
# This script rotates among servers listed in .rexhosts,
# by choosing the host at the top of the list and moving
# it to the bottom.
#Rotate the list, putting the current host to the bottom.
set HOST = `head -1 $home/.rexhosts`
set JOBID = $$
tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID
echo $HOST >> /tmp/.rexhosts.$JOBID
/usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts
# Write out the current host name
echo $HOST
You must also have a file in your home directory called
.rexhosts, listing remote hosts, such as
graucho.cc.umanitoba.ca
harpo.cc.umanitoba.ca
chico.cc.umanitoba.ca
zeppo.cc.umanitoba.ca
Each time choosehost is called, choosehost will rotate the
names in the file. For example, starting with the .rexhosts
as shown, it will move graucho.cc.umanitoba.ca to the bottom
of the file, and write the line 'graucho.cc.umanitoba.ca'
to the standard output. The next time choosehosts is
run, it would write 'harpo.cc.umanitoba.ca', and so on.
Depending on your local configuration, you may wish to
rewrite choosehosts. All that is really necessary is that
echo `choosehost` should return the name of a valid host.
Once you have installed choosehost and tested it, you can
get FINDKEY to use choosehost simply by setting
setenv XYLEM_RHOST choosehost
in your .cshrc file.
--------------- Remote filesystems -----------------------
Finally, an alternative to remote execution is to remotely mount
the file system containing the databases across the network.
This has the advantage of simplicity, and means that the
databases are available for ALL programs on your local
workstation. However, it may still be advantageous to run
FETCH remotely, since that will shift much of the computational
load to another host.
BUGS
When retrieving entries directly from GenBank, FETCH uses the
Accession Number index file gbacc.idx. In this case, FETCH
can retrieve all entries containing a given accession number.
This capability makes it possible to retrieve an entry using a
secondary accession number. However if more than one entry
share a secondary accession number, all of those entries will
be retrieved. While this behavior might be a bit of an
annoyance at times, it can also be useful because it alerts
the user to the presence of other, related entries that might
be of interest.
SEE ALSO
getloc features
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca
REFERENCE
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.