320 lines
14 KiB
Text
320 lines
14 KiB
Text
|
|
FETCH.DOC update 24 Feb 96
|
|
|
|
|
|
NAME
|
|
fetch - retrieves database entries by name or accession number
|
|
|
|
SYNOPSIS
|
|
fetch {interactive mode}
|
|
fetch [options] namefile [output file] {batch mode}
|
|
|
|
DESCRIPTION
|
|
fetch retrieves one or more entries from a database.
|
|
|
|
Interactive mode: fetch prompts the user to set search parameters,
|
|
using an interactive menu:
|
|
___________________________________________________________________
|
|
FETCH - Version 7 Feb 94
|
|
Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
|
|
___________________________________________________________________
|
|
Namefile:
|
|
Outfile:
|
|
Database:
|
|
-------------------------------------------------------------------
|
|
Parameter Description Value
|
|
|
|
1) Name/Acc Name or Accession sequence to get
|
|
2) Namefile Get list of sequences from Namefile
|
|
3) WhatToGet a:annotation s:sequence b:both b
|
|
4) Database g:GenBank p:PIR v:VecBase l:LiMB g
|
|
G:GenBank dataset P:PIR dataset
|
|
5) Outfile Send all output to a single file (Outfile)
|
|
6) Files f:Send each entry to a separate file f
|
|
-------------------------------------------------------------
|
|
Type number of your choice or 0 to continue:
|
|
|
|
After all parameters have been set, type 0 to commence the search.
|
|
Messages regarding the progress of the search will be printed.
|
|
|
|
(1,2) Which entries to get?
|
|
If you want to get a single entry, option 1 lets you type in the
|
|
name of that entry, without having to create a namefile. To get
|
|
more than one entry, choose option 2, and specify the name of a
|
|
file containing sequence names or accession numbers.
|
|
|
|
namefile is a file containing one or more sequence names or
|
|
accession numbers, each on a separate line. Names and accession
|
|
numbers can even be interspersed, in upper or lowercase, and in
|
|
any order. For example, the namefile prp.nam might contain
|
|
|
|
; plant pathogenesis related proteins
|
|
; (these are sample comment lines)
|
|
; note that any line containing a semicolon is ignored
|
|
x06362
|
|
x05454
|
|
TOBPR1A1
|
|
; comments can be interspersed with names.
|
|
PUMPR13
|
|
tobpr1ar
|
|
|
|
Options 1 & 2 are mutually exclusive. Setting one will negate the
|
|
other. If option 2 is chosen, the name of the namefile will appear
|
|
at the top of the menu.
|
|
|
|
(3) WhatToGet
|
|
Use this option to specify whether to get annotation, sequence,
|
|
or both (default=both).
|
|
|
|
(4) Database
|
|
Use this option to select the database. (default=GenBank).
|
|
G and P select user-created database subsets containing GenBank
|
|
or PIR entries, respectively. It is assumed that the database
|
|
has been split into .ano, .wrp and .ind files using splitdb.
|
|
For example, if you had created a database subset called PR1.pir,
|
|
splitdb would create PR1.ano, PR1.wrp and PR1.ind. These are
|
|
the files actually read by FETCH. When prompted for the name
|
|
of the database, simply type "PR1", without a file extension.
|
|
(If you do type a file extension, it will be ignored).
|
|
|
|
(5, 6) Where to send output
|
|
By default, option 6 is set to f, and each entry will be written to
|
|
a separate file, where the name of the file is the name of the
|
|
entry, followed by a file extension. If a complete entry is
|
|
retrieved, the file extension will indicate the type of database
|
|
(GenBank: .gen; PIR: .pir, Vecbase: .vec; LiMB: .LiMB). If only
|
|
annotation or sequence are retrieved, the file extensions will be
|
|
.ano or .wrp, respectively. Using the default, the namefile above
|
|
would create the following files:
|
|
|
|
PUMPR13.gen
|
|
TOBPR1A1.gen
|
|
TOBPR1AR.gen
|
|
TOBPR1CR.gen
|
|
TOBPR1PS.gen
|
|
|
|
By choosing option 5, you can specify the name of an output file
|
|
for all entries to go to. This filename will appear at the top
|
|
of the menu. Obviously, options 5 & 6 are mutually exclusive.
|
|
Note entries retrieved are writen in alphabetical order (sorting by
|
|
ASCII values), not the order in which they appeared in namefile.
|
|
|
|
(Note for remote users only: -f will only work for a single
|
|
name/accession supplied in 1). -f IS NOT ENABLED FOR NAMEFILES
|
|
specified in 2).)
|
|
|
|
Batch mode:
|
|
Although it is transparent to the user, all fetch really does
|
|
is call getloc, saving the user the trouble of knowing which
|
|
database files to retrieve sequences from, or of having to
|
|
execute getloc multiple times to retrieve sequences from
|
|
different database files. Thus, the options are identical to those
|
|
for getloc:
|
|
|
|
-a Write annotation portions of entries only, terminated by '//'.
|
|
-s Write sequence data only, in Pearson (.wrp) format.
|
|
-f Write each entry to a separate file.
|
|
-g GenBank (default)
|
|
-e EMBL {not implemented}
|
|
-p PIR (NBRF)
|
|
-v Vecbase
|
|
-l LiMB
|
|
-G GenBank_dataset
|
|
-P PIR_dataset
|
|
|
|
If -f is not specified, outfile must be specified.
|
|
|
|
-L force execution of findkey on local host even if
|
|
$XYLEM_RHOST is set. See "REMOTE EXECUTION" below
|
|
|
|
|
|
PIR_dataset
|
|
GenBank_dataset
|
|
This can be either a file of PIR entries, a file of GenBank entries,
|
|
or a XYLEM dataset created by splitdb. A file of PIR entries must
|
|
have the file extension ".pir". A file of GenBank entries must have
|
|
the file extension ".gen". A XYLEM dataset contains PIR entries split
|
|
among three files by splitdb: annotation (.ano), sequence (.wrp)
|
|
and index (.ind). These file extensions must be used!
|
|
|
|
When specifying a split dataset, only the base name needs to be
|
|
used. For example given a XYLEM dataset consisting of the files
|
|
myset.ano, myset.wrp and myset.ind, the following two commands
|
|
are equivalent:
|
|
|
|
fetch -P myset something.nam something.pir
|
|
fetch -P myset.ano something.nam something.pir
|
|
|
|
If the original .pir file had been used, the command would have
|
|
been
|
|
|
|
fetch -P myset.pir something.nam something.pir
|
|
|
|
The ability to work directly with .gen or .pir files is quite
|
|
convenient. However, since FETCH needs to work with a split
|
|
FETCH automatically splits .pir or .gen files into .ano, .wrp
|
|
and .ind files, which are removed when finished. This requires
|
|
extra disk space and execution time, which could be significant
|
|
for large datasets.
|
|
|
|
EXAMPLES
|
|
Batch example:
|
|
fetch -f chitinase.nam
|
|
will retrieve annotation and sequence for sequences listed in
|
|
chitinase.nam from GenBank, writing each entry to a separate file
|
|
with the extension .gen.
|
|
|
|
fetch -s -v pbr.nam pbr.wrp
|
|
will retrieve sequence data only for the entries listed in pbr.nam,
|
|
from VecBase, and write all sequences to a Pearson format file
|
|
(ie. readable by fasta) with the name pbr.wrp.
|
|
|
|
fetch -G sample sample.nam new.gen
|
|
fetch -G sample.ano sample.nam new.gen
|
|
Assumes that a set of GenBank entries has been split by splitdb
|
|
into sample.ano sample.wrp and sample.ind. The entries listed in
|
|
sample.nam are written to new.gen.
|
|
|
|
|
|
FILES
|
|
Database files:
|
|
The directories for database files are specified by the environment
|
|
variables $GB (GenBank) $PIR (PIR/NBRF) $VEC(Vecbase) and $LIMB
|
|
(LiMB).
|
|
|
|
Index files are $GB/gbacc.idx for GenBank (this file is supplied
|
|
with each GenBank release), while the other databases
|
|
use .ind files generated by splitdb. Split database files MUST
|
|
have the following file extensions: .ano {annotation}, .wrp
|
|
{sequence} and .ind {index}. Thus, when creating database files
|
|
for pir1.dat with splitdb, the output files should be pir1.ano,
|
|
pir1.wrp and pir1.ind.
|
|
|
|
Temporary files:
|
|
NAMEFILE.fetch
|
|
PRELIMINARY.fetch
|
|
TMP.fetch
|
|
FOUND.fetch
|
|
FETCHDIR {temporary directory}
|
|
|
|
REMOTE EXECUTION
|
|
Where the databases can not be stored locally, FETCH can call
|
|
FETCH on another system and retrieve the results. To run
|
|
FETCH remotely, your .cshrc file should contain the following
|
|
lines:
|
|
|
|
setenv XYLEM_RHOST remotehostname
|
|
setenv XYLEM_USERID remoteuserid
|
|
|
|
where remotehostname is the name of the host on which the
|
|
databases reside (in XYLEM split format) and remoteuserid
|
|
is your userid on the remote system. When run remotely,
|
|
your local copy of FETCH will generate the following
|
|
commands:
|
|
|
|
rcp filename $XYLEM_USERID@$XYLEM_HOST:filename
|
|
rsh $XYLEM_RHOST -l $XYLEM_USERID fetch ...
|
|
rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename
|
|
rsh $XYLEM_RHOST -l $XYLEM_USERID $RM temporary_files
|
|
|
|
Because FETCH uses rsh and rcp, your home directory on both
|
|
the local and remote systems must have a world-readable
|
|
file called .rhosts, containing the names of trusted remote
|
|
hosts and your userid on each host. Before trying to get
|
|
FETCH to work remotely, make sure that you can rcp and
|
|
rsh to the remote host.
|
|
|
|
Obviously, remote execution of FETCH implies that FETCH
|
|
must already be installed on the remote host. When FETCH
|
|
runs another copy of FETCH remotely, it uses the -L option
|
|
(findkey -L) to insure that the remote FETCH job executes,
|
|
rather than calling yet another FETCH on another host.
|
|
|
|
|
|
---------- Remote execution on more than 1 host -----------
|
|
If more than 1 remote host is available for running FINDKEY
|
|
(say, in a clustered environment where many servers mount
|
|
a common filesystem) the choice of a host can be determined
|
|
by the csh script choosehost, such that execution of
|
|
choosehost returns the name of a remote server. To use this
|
|
approach, the following script, called 'choosehost' should
|
|
be in your bin directory:
|
|
|
|
#!/bin/csh
|
|
# choosehost - choose a host to use for a remote job.
|
|
# This script rotates among servers listed in .rexhosts,
|
|
# by choosing the host at the top of the list and moving
|
|
# it to the bottom.
|
|
|
|
#Rotate the list, putting the current host to the bottom.
|
|
set HOST = `head -1 $home/.rexhosts`
|
|
set JOBID = $$
|
|
tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID
|
|
echo $HOST >> /tmp/.rexhosts.$JOBID
|
|
/usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts
|
|
|
|
# Write out the current host name
|
|
echo $HOST
|
|
|
|
You must also have a file in your home directory called
|
|
.rexhosts, listing remote hosts, such as
|
|
|
|
graucho.cc.umanitoba.ca
|
|
harpo.cc.umanitoba.ca
|
|
chico.cc.umanitoba.ca
|
|
zeppo.cc.umanitoba.ca
|
|
|
|
Each time choosehost is called, choosehost will rotate the
|
|
names in the file. For example, starting with the .rexhosts
|
|
as shown, it will move graucho.cc.umanitoba.ca to the bottom
|
|
of the file, and write the line 'graucho.cc.umanitoba.ca'
|
|
to the standard output. The next time choosehosts is
|
|
run, it would write 'harpo.cc.umanitoba.ca', and so on.
|
|
|
|
Depending on your local configuration, you may wish to
|
|
rewrite choosehosts. All that is really necessary is that
|
|
echo `choosehost` should return the name of a valid host.
|
|
|
|
Once you have installed choosehost and tested it, you can
|
|
get FINDKEY to use choosehost simply by setting
|
|
|
|
setenv XYLEM_RHOST choosehost
|
|
|
|
in your .cshrc file.
|
|
|
|
--------------- Remote filesystems -----------------------
|
|
Finally, an alternative to remote execution is to remotely mount
|
|
the file system containing the databases across the network.
|
|
This has the advantage of simplicity, and means that the
|
|
databases are available for ALL programs on your local
|
|
workstation. However, it may still be advantageous to run
|
|
FETCH remotely, since that will shift much of the computational
|
|
load to another host.
|
|
|
|
BUGS
|
|
When retrieving entries directly from GenBank, FETCH uses the
|
|
Accession Number index file gbacc.idx. In this case, FETCH
|
|
can retrieve all entries containing a given accession number.
|
|
This capability makes it possible to retrieve an entry using a
|
|
secondary accession number. However if more than one entry
|
|
share a secondary accession number, all of those entries will
|
|
be retrieved. While this behavior might be a bit of an
|
|
annoyance at times, it can also be useful because it alerts
|
|
the user to the presence of other, related entries that might
|
|
be of interest.
|
|
|
|
SEE ALSO
|
|
getloc features
|
|
|
|
AUTHOR
|
|
Dr. Brian Fristensky
|
|
Dept. of Plant Science
|
|
University of Manitoba
|
|
Winnipeg, MB Canada R3T 2N2
|
|
Phone: 204-474-6085
|
|
FAX: 204-261-5732
|
|
frist@cc.umanitoba.ca
|
|
|
|
REFERENCE
|
|
Fristensky, B. (1993) Feature expressions: creating and manipulating
|
|
sequence datasets. Nucleic Acids Research 21:5997-6003.
|