328 lines
16 KiB
Text
328 lines
16 KiB
Text
|
|
||
|
GETOB 21 Dec 94
|
||
|
|
||
|
|
||
|
NAME
|
||
|
getob - Get an object from GenBank
|
||
|
|
||
|
SYNOPSIS
|
||
|
getob [-frcn] infile namefile anofile seqfile indfile message
|
||
|
[outfile] expfile
|
||
|
|
||
|
DESCRIPTION
|
||
|
getob extracts 'objects' (subsequences) from GenBank entries, using
|
||
|
the features table, and writes them to outfile (.out). A log
|
||
|
describing the construction of each object is written to message
|
||
|
(.msg). If -r is not set, a list of expressions that would recreate
|
||
|
the .out file if evaluated by getob -r, is written to expfile (.exp)
|
||
|
|
||
|
The following options are available:
|
||
|
|
||
|
f Write each entry to a separate file. The name will consist
|
||
|
of the entry name, and the extension '.obj'.
|
||
|
|
||
|
r Resolve expressions from namefile into objects.
|
||
|
Expressions take the form:
|
||
|
|
||
|
@[<database>::]<accession>:<location>
|
||
|
|
||
|
In effect, r makes it possible to use getob to resolve
|
||
|
features that span more than one entry, such as segmented
|
||
|
files. In the first run of the program, features that require
|
||
|
data from outside the entry in which they are defined will be
|
||
|
written to outfile with those externally-defined parts rep-
|
||
|
resented using the '@' notation described above. During a
|
||
|
subsequent run, the outfile from the previous run is used as
|
||
|
namefile. When r is set, all lines not beginning with '@' (ie.
|
||
|
name lines and sequence lines) are simply copied to the new
|
||
|
outfile. When an '@' is encountered, the expression is parsed
|
||
|
into accession number and location. The entry with the
|
||
|
specified accession number is located in indfile, and read from
|
||
|
anofile and seqfile. It is then evaluated, and the result
|
||
|
written to outfile in place of the '@' expression.
|
||
|
|
||
|
getob can also be used to get specific labeled objects from
|
||
|
a given entry. Examples:
|
||
|
|
||
|
@k30576:polyprotein
|
||
|
@k30576:/label=polyprotein
|
||
|
@x10345:/product="hsp70"
|
||
|
@j00879:group(1..2200,mutation_37)
|
||
|
|
||
|
The first two constructs given above are equivalent. Both
|
||
|
will extract the feature called polyprotein. The third
|
||
|
construct shows that any feature label can be specified. If
|
||
|
none is specified, as in the first example, then /label= is
|
||
|
assumed. One limitation, however, is that the label sought
|
||
|
must be unique within the entry in its first 15 characters
|
||
|
including double quotes ("). Otherwise, only the first
|
||
|
matching label expression will be evaluated. Finally, the
|
||
|
last example shows that a mutant sequence can be constructed
|
||
|
by first specifying an expression that evaluates to a
|
||
|
sequence (ie. 1..2200) and then a labeled expression that
|
||
|
upon evaluation, uses replace() to modify that sequence. The
|
||
|
usage shown in examples 3 & 4 above represent extensions to
|
||
|
the DDBJ/EMBL/GenBank Features Table Format.
|
||
|
|
||
|
As touched on briefly above, the r option makes it possible
|
||
|
to construct objects that include recursive references to
|
||
|
other entries (eg. segmented files) by iterative calls to
|
||
|
getob. The 'features' command automates this process. The basic
|
||
|
algorithm is as follows:
|
||
|
|
||
|
getob infile namefile anofile seqfile indfile ...
|
||
|
|
||
|
#Pull out all lines containing indirect references
|
||
|
grep '@' outfile > unresolved.grep
|
||
|
|
||
|
while (unresolved.grep is not empty)
|
||
|
|
||
|
#extract accession numbers to be retrieved
|
||
|
cut -c2-7 unresolved.grep > unresolved.nam
|
||
|
|
||
|
#retrieve the sequences into a new file, and create
|
||
|
#a database subset to be used by getob
|
||
|
fetch unresolved.nam new.gen
|
||
|
splitdb new.gen new.ano new.wrp new.ind
|
||
|
|
||
|
#run getob again to resolve indirect references
|
||
|
getob -r infile outfile new.ano new.wrp new.ind ...
|
||
|
|
||
|
#Pull out all lines containing indirect references
|
||
|
grep '@' outfile > unresolved.grep
|
||
|
end
|
||
|
|
||
|
c NAMEFILE contains accession numbers, rather than locus names
|
||
|
|
||
|
n By default, the qualifier 'codon_start' is used to determine
|
||
|
how many n's, if necessary, must be added to the 5' end of
|
||
|
CDS, mat_peptide, or sig_peptide, to preserve the reading
|
||
|
frame. To turn OFF this feature, -n must be set. -n must be set
|
||
|
for GenBank Releases 67.0 and earlier.
|
||
|
|
||
|
infile contains commands indicating what data is to be pulled from
|
||
|
each entry. Two types of output may be presented, GenBank or
|
||
|
OBJECTS. These are described below:
|
||
|
|
||
|
1) GenBank output - If the word 'GENBANK' is the first line in
|
||
|
infile, a pseudo-GenBank entry will be recreated. This option
|
||
|
is only intended for debugging purposes and will probably be
|
||
|
removed in later releases.
|
||
|
|
||
|
2) Object format - This option instructs getob to write part or
|
||
|
all of each sequence, along with site annotation, by specifying
|
||
|
feature key names. The syntax for infile is shown below:
|
||
|
|
||
|
Backus-Naur format: Example:
|
||
|
----------------------------------------------------------
|
||
|
OBJECTS OBJECTS
|
||
|
<feature key> tRNA
|
||
|
{<feature key> rRNA
|
||
|
. . . SITES
|
||
|
<feature key>} stem_loop
|
||
|
SITES
|
||
|
{<feature key>
|
||
|
. . .
|
||
|
<feature key>}
|
||
|
|
||
|
In the example above, getob is instructed to extract all tRNA or
|
||
|
rRNA sequences from each entry, and annotate the position of each
|
||
|
stem/loop structure. Note that the SITES coordinates written to the
|
||
|
file tell the positions of those SITES relative to the start of the
|
||
|
object, rather than the original location in the sequence. As above,
|
||
|
each word begins a separate line.
|
||
|
|
||
|
While the -r option does not use infile, at least a dummy infile
|
||
|
must be included in the command line. This dummy file need only
|
||
|
contain two lines:
|
||
|
|
||
|
OBJECTS
|
||
|
SITES
|
||
|
|
||
|
NOTE: SITES IS NOT YET IMPLEMENTED! Although inclusion of SITES in
|
||
|
the input file will have no effect, the word SITES must still be
|
||
|
present after the last feature key.
|
||
|
|
||
|
|
||
|
namefile
|
||
|
namefile consists of a list of LOCUS names or accession numbers,
|
||
|
each on a separate line. Names or accession numbers should appear
|
||
|
in the order in which they appear in the database file. Unordered
|
||
|
namefiles will slow the progress of the search. Since only the
|
||
|
first non-blank field of each line in namefile is read, indfile
|
||
|
could be used to create a namefile. If the entire indfile was
|
||
|
used, the entire database file would be processed. A sample
|
||
|
namefile requesting four sequences by LOCUS name is shown below:
|
||
|
|
||
|
POTPR1A
|
||
|
POTPSTH2
|
||
|
POTPSTH21
|
||
|
POTSTHA
|
||
|
|
||
|
anofile, seqfile, and indfile
|
||
|
The database subset containing GenBank entries must be divided
|
||
|
among annotation, sequence and an index by splitdb.
|
||
|
|
||
|
message
|
||
|
message contains a log describing the parsing of each object.
|
||
|
For annotative purposes, qualifier lines from the object are
|
||
|
included in along with the location expression being parsed.
|
||
|
The beginning of a typical message file is shown below:
|
||
|
|
||
|
GETOB Version 0.962 14 May 1992
|
||
|
|
||
|
POTPR1A:CDS1
|
||
|
join
|
||
|
(
|
||
|
295 603
|
||
|
|
||
|
1011 1355
|
||
|
|
||
|
)
|
||
|
|
||
|
|
||
|
/note="pathogenesis-related protein (prp1)"
|
||
|
/codon_start=1
|
||
|
/translation="MAEVKLLGLRYSPFSHRVEWALKIKGVKYEFIEEDLQNKSPLLL
|
||
|
QSNPIHKKIPVLIHNGKCICESMVILEYIDEAFEGPSILPKDPYDRALARFWAKYVED
|
||
|
KGAAVWKSFFSKGEEQEKAKEEAYEMLKILDNEFKDKKCFVGDKFGFADIVANGAALY
|
||
|
LGILEEVSGIVLATSEKFPNFCAWRDEYCTQNEEYFPSRDELLIRYRAYIQPVDASK"
|
||
|
//----------------------------------------------
|
||
|
|
||
|
In the example above, getob was instructed to retrieve all CDS
|
||
|
features from the database subset. The message for the entry
|
||
|
POTPR1A is shown, along with a reconstruction of the location
|
||
|
expression that was evaluated to create the object. In this
|
||
|
case, protien coding sequences from two exons had to be joined
|
||
|
to create the object.
|
||
|
|
||
|
outfile
|
||
|
outfile contains the actual objects constructed, consisting of
|
||
|
sites found and sequences. The beginning of a typical output file
|
||
|
is shown below:
|
||
|
|
||
|
>POTPR1A:CDS1
|
||
|
atggcagaagtgaagttgcttggtctaaggtatagtccttttagccatag
|
||
|
agttgaatgggctctaaaaattaagggagtgaaatatgaatttatagagg
|
||
|
aagatttacaaaataagagccctttacttcttcaatctaatccaattcac
|
||
|
aagaaaattccagtgttaattcacaatggcaagtgcatttgtgagtctat
|
||
|
ggtcattcttgaatacattgatgaggcatttgaaggcccttccattttgc
|
||
|
ctaaagacccttatgatcgcgctttagcacgattttgggctaaatacgtc
|
||
|
gaagataag
|
||
|
ggggcagcagtgtggaaaagtttcttttcgaaaggagaggaacaagagaa
|
||
|
agctaaagaggaagcttatgagatgttgaaaattcttgataatgagttca
|
||
|
aggacaagaagtgctttgttggtgacaaatttggatttgctgatattgtt
|
||
|
gcaaatggtgcagcactttatttgggaattcttgaagaagtatctggaat
|
||
|
tgttttggcaacaagtgaaaaatttccaaatttttgtgcttggagagatg
|
||
|
aatattgcacacaaaacgaggaatattttccttcaagagatgaattgctt
|
||
|
atccgttaccgagcctacattcagcctgttgatgcttcaaaatga
|
||
|
|
||
|
In the example, the CDS from entry POTPR1A has been written in
|
||
|
two chunks, corresponding to the two exon portions of the coding
|
||
|
sequence. Each location retrieved in constructing the object is
|
||
|
written as a separate block of sequence. By comparing message file
|
||
|
to outfile, it is possible to verify the correctness of the
|
||
|
operation.
|
||
|
|
||
|
Numbers are appended to the sequence names to indicate
|
||
|
which CDS in the entry has been retrieved. Thus, if two CDS
|
||
|
features were present, the second one would be named >POTPR1A:2.
|
||
|
For compatiblility with the FASTA programs of Pearson, the name line
|
||
|
begins with a '>'.
|
||
|
|
||
|
expfile
|
||
|
The expression evaluated to create this feature is written
|
||
|
to expfile:
|
||
|
|
||
|
>POTPR1A:CDS1
|
||
|
@J03679:join(295..603,1011..1355)
|
||
|
|
||
|
expfile is only created if -r is not set. It is itended as a way
|
||
|
of automating the creation of a feature expression file for use
|
||
|
in generating customized datasets. Expressions in expfile can be
|
||
|
deleted or modified, or new expressions added, to tailor the
|
||
|
dataset to individual needs. To generate a dataset from expfile:
|
||
|
|
||
|
getob -r infile expfile anofile seqfile indfile message outfile
|
||
|
|
||
|
EXTENSIONS TO THE FEATURE TABLE LANGUAGE
|
||
|
|
||
|
1) poly(<absolute_location>|<literal>|<feature_name>,x)
|
||
|
|
||
|
This operator evaluates an absolute location, literal, or
|
||
|
feature name (ie. any location not containing functional
|
||
|
operators) and writes it x times. The most obvious
|
||
|
application of poly is to create spacers to represent regions
|
||
|
of unknown sequence between sequences that are known. For
|
||
|
example, the restriction map of a 4kb EcoR1 fragment with a
|
||
|
Hind3 site 1000 bp from one end could be represented as follows:
|
||
|
|
||
|
join("gaattc",poly("n",1000),"aagctt",poly("n",3000),"gaattc")
|
||
|
|
||
|
2) The following feature keys are recognized by GETOB, although
|
||
|
not included in the language definition. While they will not
|
||
|
appear in GenBank entries, they could be used in user-created
|
||
|
GenBank-format files:
|
||
|
|
||
|
contig
|
||
|
This feature key is meant to be used to assemble large
|
||
|
sequence segments from smaller segments, possibly using the
|
||
|
poly() operator.
|
||
|
|
||
|
chromosome
|
||
|
Intended to annotate the complete sequence of a chromosome. This
|
||
|
feature may be constructed by a join of two or more contigs.
|
||
|
|
||
|
Use of these keywords is illustrated in the features table
|
||
|
shown below, which could be used to construct a model of part
|
||
|
of the E.coli chromosome, spanning map units 763.4 to 1031.4 kb:
|
||
|
|
||
|
contig join(J01619:1..13063,poly("n",7140),
|
||
|
J03939:1..1363,poly("n",14380),
|
||
|
X02306:complement(1..1622),poly("n",14710),
|
||
|
J04423:1..5793,poly("n",22500),
|
||
|
X03722:1..2400,poly("n",123750),
|
||
|
one-of(X05017:complement(1..1854),X05017:1..1854))
|
||
|
/label=Eco_contig8
|
||
|
/map=763.4-950.6kb
|
||
|
contig join(V00352:1..2412,poly("n",28800),M15273:1..3409)
|
||
|
/label=Eco_contig9
|
||
|
/map=972.9-1001.7kb
|
||
|
contig join(X02826:1..1357,poly("n",13540),
|
||
|
J01654:complement(1..2270))
|
||
|
/label=Eco_contig10
|
||
|
/map=1016.5-1031.4kb
|
||
|
chromosome join(Eco_contig8,poly("n",22300),
|
||
|
Eco_contig9,poly("n",14800),
|
||
|
Eco_contig10)
|
||
|
/label=Ecoli_chromosome
|
||
|
|
||
|
NOTES
|
||
|
1) If the const DEBUG is set to true in the Pascal source code, getob
|
||
|
writes messages to the standard output, indicating the progress of
|
||
|
processing for each entry read in. By default, DEBUG=false.
|
||
|
This feature is solely for debugging purposes and will be removed in
|
||
|
later releases.
|
||
|
|
||
|
2) GETOB automatically expands leading blanks that have been
|
||
|
compressed using splitdb -c. See splitdb.doc for more information.
|
||
|
|
||
|
SEE ALSO
|
||
|
features, splitdb, getloc
|
||
|
The DDBJ/EMBL/GenBank Feature Table: Definition, Version 1.04
|
||
|
September 1, 1992
|
||
|
GenBank Release Notes for Release 79.0.
|
||
|
|
||
|
AUTHOR
|
||
|
Dr. Brian Fristensky
|
||
|
Dept. of Plant Science
|
||
|
University of Manitoba
|
||
|
Winnipeg, MB Canada R3T 2N2
|
||
|
Phone: 204-474-6085
|
||
|
FAX: 204-261-5732
|
||
|
frist@cc.umanitoba.ca
|
||
|
|
||
|
REFERENCE
|
||
|
Fristensky, B. (1993) Feature expressions: creating and manipulating
|
||
|
sequence datasets. Nucleic Acids Research 21:5997-6003.
|