gde_linux/CORE/xylem/getob.doc

328 lines
16 KiB
Plaintext

GETOB 21 Dec 94
NAME
getob - Get an object from GenBank
SYNOPSIS
getob [-frcn] infile namefile anofile seqfile indfile message
[outfile] expfile
DESCRIPTION
getob extracts 'objects' (subsequences) from GenBank entries, using
the features table, and writes them to outfile (.out). A log
describing the construction of each object is written to message
(.msg). If -r is not set, a list of expressions that would recreate
the .out file if evaluated by getob -r, is written to expfile (.exp)
The following options are available:
f Write each entry to a separate file. The name will consist
of the entry name, and the extension '.obj'.
r Resolve expressions from namefile into objects.
Expressions take the form:
@[<database>::]<accession>:<location>
In effect, r makes it possible to use getob to resolve
features that span more than one entry, such as segmented
files. In the first run of the program, features that require
data from outside the entry in which they are defined will be
written to outfile with those externally-defined parts rep-
resented using the '@' notation described above. During a
subsequent run, the outfile from the previous run is used as
namefile. When r is set, all lines not beginning with '@' (ie.
name lines and sequence lines) are simply copied to the new
outfile. When an '@' is encountered, the expression is parsed
into accession number and location. The entry with the
specified accession number is located in indfile, and read from
anofile and seqfile. It is then evaluated, and the result
written to outfile in place of the '@' expression.
getob can also be used to get specific labeled objects from
a given entry. Examples:
@k30576:polyprotein
@k30576:/label=polyprotein
@x10345:/product="hsp70"
@j00879:group(1..2200,mutation_37)
The first two constructs given above are equivalent. Both
will extract the feature called polyprotein. The third
construct shows that any feature label can be specified. If
none is specified, as in the first example, then /label= is
assumed. One limitation, however, is that the label sought
must be unique within the entry in its first 15 characters
including double quotes ("). Otherwise, only the first
matching label expression will be evaluated. Finally, the
last example shows that a mutant sequence can be constructed
by first specifying an expression that evaluates to a
sequence (ie. 1..2200) and then a labeled expression that
upon evaluation, uses replace() to modify that sequence. The
usage shown in examples 3 & 4 above represent extensions to
the DDBJ/EMBL/GenBank Features Table Format.
As touched on briefly above, the r option makes it possible
to construct objects that include recursive references to
other entries (eg. segmented files) by iterative calls to
getob. The 'features' command automates this process. The basic
algorithm is as follows:
getob infile namefile anofile seqfile indfile ...
#Pull out all lines containing indirect references
grep '@' outfile > unresolved.grep
while (unresolved.grep is not empty)
#extract accession numbers to be retrieved
cut -c2-7 unresolved.grep > unresolved.nam
#retrieve the sequences into a new file, and create
#a database subset to be used by getob
fetch unresolved.nam new.gen
splitdb new.gen new.ano new.wrp new.ind
#run getob again to resolve indirect references
getob -r infile outfile new.ano new.wrp new.ind ...
#Pull out all lines containing indirect references
grep '@' outfile > unresolved.grep
end
c NAMEFILE contains accession numbers, rather than locus names
n By default, the qualifier 'codon_start' is used to determine
how many n's, if necessary, must be added to the 5' end of
CDS, mat_peptide, or sig_peptide, to preserve the reading
frame. To turn OFF this feature, -n must be set. -n must be set
for GenBank Releases 67.0 and earlier.
infile contains commands indicating what data is to be pulled from
each entry. Two types of output may be presented, GenBank or
OBJECTS. These are described below:
1) GenBank output - If the word 'GENBANK' is the first line in
infile, a pseudo-GenBank entry will be recreated. This option
is only intended for debugging purposes and will probably be
removed in later releases.
2) Object format - This option instructs getob to write part or
all of each sequence, along with site annotation, by specifying
feature key names. The syntax for infile is shown below:
Backus-Naur format: Example:
----------------------------------------------------------
OBJECTS OBJECTS
<feature key> tRNA
{<feature key> rRNA
. . . SITES
<feature key>} stem_loop
SITES
{<feature key>
. . .
<feature key>}
In the example above, getob is instructed to extract all tRNA or
rRNA sequences from each entry, and annotate the position of each
stem/loop structure. Note that the SITES coordinates written to the
file tell the positions of those SITES relative to the start of the
object, rather than the original location in the sequence. As above,
each word begins a separate line.
While the -r option does not use infile, at least a dummy infile
must be included in the command line. This dummy file need only
contain two lines:
OBJECTS
SITES
NOTE: SITES IS NOT YET IMPLEMENTED! Although inclusion of SITES in
the input file will have no effect, the word SITES must still be
present after the last feature key.
namefile
namefile consists of a list of LOCUS names or accession numbers,
each on a separate line. Names or accession numbers should appear
in the order in which they appear in the database file. Unordered
namefiles will slow the progress of the search. Since only the
first non-blank field of each line in namefile is read, indfile
could be used to create a namefile. If the entire indfile was
used, the entire database file would be processed. A sample
namefile requesting four sequences by LOCUS name is shown below:
POTPR1A
POTPSTH2
POTPSTH21
POTSTHA
anofile, seqfile, and indfile
The database subset containing GenBank entries must be divided
among annotation, sequence and an index by splitdb.
message
message contains a log describing the parsing of each object.
For annotative purposes, qualifier lines from the object are
included in along with the location expression being parsed.
The beginning of a typical message file is shown below:
GETOB Version 0.962 14 May 1992
POTPR1A:CDS1
join
(
295 603
1011 1355
)
/note="pathogenesis-related protein (prp1)"
/codon_start=1
/translation="MAEVKLLGLRYSPFSHRVEWALKIKGVKYEFIEEDLQNKSPLLL
QSNPIHKKIPVLIHNGKCICESMVILEYIDEAFEGPSILPKDPYDRALARFWAKYVED
KGAAVWKSFFSKGEEQEKAKEEAYEMLKILDNEFKDKKCFVGDKFGFADIVANGAALY
LGILEEVSGIVLATSEKFPNFCAWRDEYCTQNEEYFPSRDELLIRYRAYIQPVDASK"
//----------------------------------------------
In the example above, getob was instructed to retrieve all CDS
features from the database subset. The message for the entry
POTPR1A is shown, along with a reconstruction of the location
expression that was evaluated to create the object. In this
case, protien coding sequences from two exons had to be joined
to create the object.
outfile
outfile contains the actual objects constructed, consisting of
sites found and sequences. The beginning of a typical output file
is shown below:
>POTPR1A:CDS1
atggcagaagtgaagttgcttggtctaaggtatagtccttttagccatag
agttgaatgggctctaaaaattaagggagtgaaatatgaatttatagagg
aagatttacaaaataagagccctttacttcttcaatctaatccaattcac
aagaaaattccagtgttaattcacaatggcaagtgcatttgtgagtctat
ggtcattcttgaatacattgatgaggcatttgaaggcccttccattttgc
ctaaagacccttatgatcgcgctttagcacgattttgggctaaatacgtc
gaagataag
ggggcagcagtgtggaaaagtttcttttcgaaaggagaggaacaagagaa
agctaaagaggaagcttatgagatgttgaaaattcttgataatgagttca
aggacaagaagtgctttgttggtgacaaatttggatttgctgatattgtt
gcaaatggtgcagcactttatttgggaattcttgaagaagtatctggaat
tgttttggcaacaagtgaaaaatttccaaatttttgtgcttggagagatg
aatattgcacacaaaacgaggaatattttccttcaagagatgaattgctt
atccgttaccgagcctacattcagcctgttgatgcttcaaaatga
In the example, the CDS from entry POTPR1A has been written in
two chunks, corresponding to the two exon portions of the coding
sequence. Each location retrieved in constructing the object is
written as a separate block of sequence. By comparing message file
to outfile, it is possible to verify the correctness of the
operation.
Numbers are appended to the sequence names to indicate
which CDS in the entry has been retrieved. Thus, if two CDS
features were present, the second one would be named >POTPR1A:2.
For compatiblility with the FASTA programs of Pearson, the name line
begins with a '>'.
expfile
The expression evaluated to create this feature is written
to expfile:
>POTPR1A:CDS1
@J03679:join(295..603,1011..1355)
expfile is only created if -r is not set. It is itended as a way
of automating the creation of a feature expression file for use
in generating customized datasets. Expressions in expfile can be
deleted or modified, or new expressions added, to tailor the
dataset to individual needs. To generate a dataset from expfile:
getob -r infile expfile anofile seqfile indfile message outfile
EXTENSIONS TO THE FEATURE TABLE LANGUAGE
1) poly(<absolute_location>|<literal>|<feature_name>,x)
This operator evaluates an absolute location, literal, or
feature name (ie. any location not containing functional
operators) and writes it x times. The most obvious
application of poly is to create spacers to represent regions
of unknown sequence between sequences that are known. For
example, the restriction map of a 4kb EcoR1 fragment with a
Hind3 site 1000 bp from one end could be represented as follows:
join("gaattc",poly("n",1000),"aagctt",poly("n",3000),"gaattc")
2) The following feature keys are recognized by GETOB, although
not included in the language definition. While they will not
appear in GenBank entries, they could be used in user-created
GenBank-format files:
contig
This feature key is meant to be used to assemble large
sequence segments from smaller segments, possibly using the
poly() operator.
chromosome
Intended to annotate the complete sequence of a chromosome. This
feature may be constructed by a join of two or more contigs.
Use of these keywords is illustrated in the features table
shown below, which could be used to construct a model of part
of the E.coli chromosome, spanning map units 763.4 to 1031.4 kb:
contig join(J01619:1..13063,poly("n",7140),
J03939:1..1363,poly("n",14380),
X02306:complement(1..1622),poly("n",14710),
J04423:1..5793,poly("n",22500),
X03722:1..2400,poly("n",123750),
one-of(X05017:complement(1..1854),X05017:1..1854))
/label=Eco_contig8
/map=763.4-950.6kb
contig join(V00352:1..2412,poly("n",28800),M15273:1..3409)
/label=Eco_contig9
/map=972.9-1001.7kb
contig join(X02826:1..1357,poly("n",13540),
J01654:complement(1..2270))
/label=Eco_contig10
/map=1016.5-1031.4kb
chromosome join(Eco_contig8,poly("n",22300),
Eco_contig9,poly("n",14800),
Eco_contig10)
/label=Ecoli_chromosome
NOTES
1) If the const DEBUG is set to true in the Pascal source code, getob
writes messages to the standard output, indicating the progress of
processing for each entry read in. By default, DEBUG=false.
This feature is solely for debugging purposes and will be removed in
later releases.
2) GETOB automatically expands leading blanks that have been
compressed using splitdb -c. See splitdb.doc for more information.
SEE ALSO
features, splitdb, getloc
The DDBJ/EMBL/GenBank Feature Table: Definition, Version 1.04
September 1, 1992
GenBank Release Notes for Release 79.0.
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca
REFERENCE
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.