FETCH.DOC update 24 Feb 96 NAME fetch - retrieves database entries by name or accession number SYNOPSIS fetch {interactive mode} fetch [options] namefile [output file] {batch mode} DESCRIPTION fetch retrieves one or more entries from a database. Interactive mode: fetch prompts the user to set search parameters, using an interactive menu: ___________________________________________________________________ FETCH - Version 7 Feb 94 Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003 ___________________________________________________________________ Namefile: Outfile: Database: ------------------------------------------------------------------- Parameter Description Value 1) Name/Acc Name or Accession sequence to get 2) Namefile Get list of sequences from Namefile 3) WhatToGet a:annotation s:sequence b:both b 4) Database g:GenBank p:PIR v:VecBase l:LiMB g G:GenBank dataset P:PIR dataset 5) Outfile Send all output to a single file (Outfile) 6) Files f:Send each entry to a separate file f ------------------------------------------------------------- Type number of your choice or 0 to continue: After all parameters have been set, type 0 to commence the search. Messages regarding the progress of the search will be printed. (1,2) Which entries to get? If you want to get a single entry, option 1 lets you type in the name of that entry, without having to create a namefile. To get more than one entry, choose option 2, and specify the name of a file containing sequence names or accession numbers. namefile is a file containing one or more sequence names or accession numbers, each on a separate line. Names and accession numbers can even be interspersed, in upper or lowercase, and in any order. For example, the namefile prp.nam might contain ; plant pathogenesis related proteins ; (these are sample comment lines) ; note that any line containing a semicolon is ignored x06362 x05454 TOBPR1A1 ; comments can be interspersed with names. PUMPR13 tobpr1ar Options 1 & 2 are mutually exclusive. Setting one will negate the other. If option 2 is chosen, the name of the namefile will appear at the top of the menu. (3) WhatToGet Use this option to specify whether to get annotation, sequence, or both (default=both). (4) Database Use this option to select the database. (default=GenBank). G and P select user-created database subsets containing GenBank or PIR entries, respectively. It is assumed that the database has been split into .ano, .wrp and .ind files using splitdb. For example, if you had created a database subset called PR1.pir, splitdb would create PR1.ano, PR1.wrp and PR1.ind. These are the files actually read by FETCH. When prompted for the name of the database, simply type "PR1", without a file extension. (If you do type a file extension, it will be ignored). (5, 6) Where to send output By default, option 6 is set to f, and each entry will be written to a separate file, where the name of the file is the name of the entry, followed by a file extension. If a complete entry is retrieved, the file extension will indicate the type of database (GenBank: .gen; PIR: .pir, Vecbase: .vec; LiMB: .LiMB). If only annotation or sequence are retrieved, the file extensions will be .ano or .wrp, respectively. Using the default, the namefile above would create the following files: PUMPR13.gen TOBPR1A1.gen TOBPR1AR.gen TOBPR1CR.gen TOBPR1PS.gen By choosing option 5, you can specify the name of an output file for all entries to go to. This filename will appear at the top of the menu. Obviously, options 5 & 6 are mutually exclusive. Note entries retrieved are writen in alphabetical order (sorting by ASCII values), not the order in which they appeared in namefile. (Note for remote users only: -f will only work for a single name/accession supplied in 1). -f IS NOT ENABLED FOR NAMEFILES specified in 2).) Batch mode: Although it is transparent to the user, all fetch really does is call getloc, saving the user the trouble of knowing which database files to retrieve sequences from, or of having to execute getloc multiple times to retrieve sequences from different database files. Thus, the options are identical to those for getloc: -a Write annotation portions of entries only, terminated by '//'. -s Write sequence data only, in Pearson (.wrp) format. -f Write each entry to a separate file. -g GenBank (default) -e EMBL {not implemented} -p PIR (NBRF) -v Vecbase -l LiMB -G GenBank_dataset -P PIR_dataset If -f is not specified, outfile must be specified. -L force execution of findkey on local host even if $XYLEM_RHOST is set. See "REMOTE EXECUTION" below PIR_dataset GenBank_dataset This can be either a file of PIR entries, a file of GenBank entries, or a XYLEM dataset created by splitdb. A file of PIR entries must have the file extension ".pir". A file of GenBank entries must have the file extension ".gen". A XYLEM dataset contains PIR entries split among three files by splitdb: annotation (.ano), sequence (.wrp) and index (.ind). These file extensions must be used! When specifying a split dataset, only the base name needs to be used. For example given a XYLEM dataset consisting of the files myset.ano, myset.wrp and myset.ind, the following two commands are equivalent: fetch -P myset something.nam something.pir fetch -P myset.ano something.nam something.pir If the original .pir file had been used, the command would have been fetch -P myset.pir something.nam something.pir The ability to work directly with .gen or .pir files is quite convenient. However, since FETCH needs to work with a split FETCH automatically splits .pir or .gen files into .ano, .wrp and .ind files, which are removed when finished. This requires extra disk space and execution time, which could be significant for large datasets. EXAMPLES Batch example: fetch -f chitinase.nam will retrieve annotation and sequence for sequences listed in chitinase.nam from GenBank, writing each entry to a separate file with the extension .gen. fetch -s -v pbr.nam pbr.wrp will retrieve sequence data only for the entries listed in pbr.nam, from VecBase, and write all sequences to a Pearson format file (ie. readable by fasta) with the name pbr.wrp. fetch -G sample sample.nam new.gen fetch -G sample.ano sample.nam new.gen Assumes that a set of GenBank entries has been split by splitdb into sample.ano sample.wrp and sample.ind. The entries listed in sample.nam are written to new.gen. FILES Database files: The directories for database files are specified by the environment variables $GB (GenBank) $PIR (PIR/NBRF) $VEC(Vecbase) and $LIMB (LiMB). Index files are $GB/gbacc.idx for GenBank (this file is supplied with each GenBank release), while the other databases use .ind files generated by splitdb. Split database files MUST have the following file extensions: .ano {annotation}, .wrp {sequence} and .ind {index}. Thus, when creating database files for pir1.dat with splitdb, the output files should be pir1.ano, pir1.wrp and pir1.ind. Temporary files: NAMEFILE.fetch PRELIMINARY.fetch TMP.fetch FOUND.fetch FETCHDIR {temporary directory} REMOTE EXECUTION Where the databases can not be stored locally, FETCH can call FETCH on another system and retrieve the results. To run FETCH remotely, your .cshrc file should contain the following lines: setenv XYLEM_RHOST remotehostname setenv XYLEM_USERID remoteuserid where remotehostname is the name of the host on which the databases reside (in XYLEM split format) and remoteuserid is your userid on the remote system. When run remotely, your local copy of FETCH will generate the following commands: rcp filename $XYLEM_USERID@$XYLEM_HOST:filename rsh $XYLEM_RHOST -l $XYLEM_USERID fetch ... rcp $XYLEM_USERID@$XYLEM_HOST:outputfilename outputfilename rsh $XYLEM_RHOST -l $XYLEM_USERID $RM temporary_files Because FETCH uses rsh and rcp, your home directory on both the local and remote systems must have a world-readable file called .rhosts, containing the names of trusted remote hosts and your userid on each host. Before trying to get FETCH to work remotely, make sure that you can rcp and rsh to the remote host. Obviously, remote execution of FETCH implies that FETCH must already be installed on the remote host. When FETCH runs another copy of FETCH remotely, it uses the -L option (findkey -L) to insure that the remote FETCH job executes, rather than calling yet another FETCH on another host. ---------- Remote execution on more than 1 host ----------- If more than 1 remote host is available for running FINDKEY (say, in a clustered environment where many servers mount a common filesystem) the choice of a host can be determined by the csh script choosehost, such that execution of choosehost returns the name of a remote server. To use this approach, the following script, called 'choosehost' should be in your bin directory: #!/bin/csh # choosehost - choose a host to use for a remote job. # This script rotates among servers listed in .rexhosts, # by choosing the host at the top of the list and moving # it to the bottom. #Rotate the list, putting the current host to the bottom. set HOST = `head -1 $home/.rexhosts` set JOBID = $$ tail +2 $home/.rexhosts > /tmp/.rexhosts.$JOBID echo $HOST >> /tmp/.rexhosts.$JOBID /usr/bin/mv /tmp/.rexhosts.$JOBID $home/.rexhosts # Write out the current host name echo $HOST You must also have a file in your home directory called .rexhosts, listing remote hosts, such as graucho.cc.umanitoba.ca harpo.cc.umanitoba.ca chico.cc.umanitoba.ca zeppo.cc.umanitoba.ca Each time choosehost is called, choosehost will rotate the names in the file. For example, starting with the .rexhosts as shown, it will move graucho.cc.umanitoba.ca to the bottom of the file, and write the line 'graucho.cc.umanitoba.ca' to the standard output. The next time choosehosts is run, it would write 'harpo.cc.umanitoba.ca', and so on. Depending on your local configuration, you may wish to rewrite choosehosts. All that is really necessary is that echo `choosehost` should return the name of a valid host. Once you have installed choosehost and tested it, you can get FINDKEY to use choosehost simply by setting setenv XYLEM_RHOST choosehost in your .cshrc file. --------------- Remote filesystems ----------------------- Finally, an alternative to remote execution is to remotely mount the file system containing the databases across the network. This has the advantage of simplicity, and means that the databases are available for ALL programs on your local workstation. However, it may still be advantageous to run FETCH remotely, since that will shift much of the computational load to another host. BUGS When retrieving entries directly from GenBank, FETCH uses the Accession Number index file gbacc.idx. In this case, FETCH can retrieve all entries containing a given accession number. This capability makes it possible to retrieve an entry using a secondary accession number. However if more than one entry share a secondary accession number, all of those entries will be retrieved. While this behavior might be a bit of an annoyance at times, it can also be useful because it alerts the user to the presence of other, related entries that might be of interest. SEE ALSO getloc features AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-261-5732 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.