408 lines
18 KiB
Text
408 lines
18 KiB
Text
|
|
||
|
FEATURES.DOC update 7 Feb 94
|
||
|
|
||
|
|
||
|
NAME
|
||
|
FEATURES - extracts features from GenBank entries
|
||
|
|
||
|
SYNOPSIS
|
||
|
features
|
||
|
features expression
|
||
|
features [-f featurekey | -F keyfile]
|
||
|
[-n name |-a accession | -e expression |
|
||
|
-N namefile |-A accfile | -E expfile]
|
||
|
[-u dbfile | -U dbfile | -g ]
|
||
|
features -h
|
||
|
|
||
|
DESCRIPTION
|
||
|
FEATURES extracts sequence objects from GenBank entries, using
|
||
|
the Features Table language. Features can be retrieved either by
|
||
|
specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by
|
||
|
evaluating expressions. In practical terms, FEATURES is actually
|
||
|
a user interface for GETOB, which actually performs the parsing
|
||
|
and extraction of sequence objects. FEATURES can be run either as
|
||
|
an interactive program or with command line arguments.
|
||
|
|
||
|
'features' with no arguments runs the program interactively.
|
||
|
'features' followed by an expression retrieves the data directly
|
||
|
from GenBank and evaluates the expression. The third form of
|
||
|
features requires all arguments to be accompanied by their
|
||
|
respective option flags. Finally, 'features -h' prints the
|
||
|
SYNOPSIS.
|
||
|
|
||
|
|
||
|
INTERACTIVE EXECUTION
|
||
|
FEATURES executed with no arguments runs interactively. An example of the
|
||
|
FEATURES menu is shown below:
|
||
|
|
||
|
___________________________________________________________________
|
||
|
FEATURES - Version 7 FEB 94
|
||
|
Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
|
||
|
___________________________________________________________________
|
||
|
Features: tRNA
|
||
|
Entries: EPFCPCG
|
||
|
Dataset:
|
||
|
___________________________________________________________________
|
||
|
Parameter Description Value
|
||
|
-------------------------------------------------------------------
|
||
|
1).................... FEATURES TO EXTRACT ....................> f
|
||
|
f:Type a feature at the keyboard
|
||
|
F:Read a list of features from a file
|
||
|
2)....................ENTRIES TO BE PROCESSED (choose one).....> n
|
||
|
Keyboard input - n:name a:accession # e:expression
|
||
|
File input - N:name(s) A:accession #(s) E:expression(s)
|
||
|
3)....................WHERE TO GET IT .........................> g
|
||
|
u:Genbank dataset g:complete GenBank database
|
||
|
U: same as u, but all entries
|
||
|
4)....................WHERE TO SEND IT ........................> a
|
||
|
s:Each feature to a separate file a:All output to same file
|
||
|
---------------------------------------------------------------
|
||
|
Type number of your choice or 0 to continue:
|
||
|
0
|
||
|
Messages will be written to EPFCPCG.msg
|
||
|
Final sequence output will be written to EPFCPCG.out
|
||
|
Expressions will be written to EPFCPCG.exp
|
||
|
Extracting features...
|
||
|
|
||
|
In the example, FEATURES was instructed to retrieve all tRNAs from
|
||
|
the GenBank entry EPFCPCG, which contains the Epifagus plastid
|
||
|
genome. By default, the GenBank database was the source of the
|
||
|
sequence. Messages indicate the progress of the job. A log describing
|
||
|
the extraction of each feature is written to EPFCPCG.msg, while the
|
||
|
extracted features themselves are written to EPFCPCG.out. Feature
|
||
|
expressions which could be used by FEATURES to reconstruct the .out
|
||
|
file, are written to EPFCPCG.exp.
|
||
|
|
||
|
The first step is to retrieve the EPFCPCG entry from GenBank, which is
|
||
|
accomplished by calling FETCH. Next, FEATURES extracts the specified
|
||
|
features from the entry.
|
||
|
|
||
|
An excerpt from EPFCPCG.msg is shown below, describing the extraction
|
||
|
of the fifth tRNA found in this entry. To create this tRNA, two exons
|
||
|
had to be joined. The qualifier line associated with this feature
|
||
|
indicates that it is an Isoleucine tRNA with a gat anticodon.
|
||
|
|
||
|
|
||
|
EPFCPCG:anticodon gtg
|
||
|
complement
|
||
|
(
|
||
|
join
|
||
|
(
|
||
|
70023 70028
|
||
|
|
||
|
1 69
|
||
|
|
||
|
)
|
||
|
|
||
|
)
|
||
|
|
||
|
|
||
|
/product="transfer RNA-His"
|
||
|
/gene="His-tRNA"
|
||
|
/label=anticodon gtg
|
||
|
/note="anticodon gtg"
|
||
|
//----------------------------------------------
|
||
|
|
||
|
|
||
|
The actual sequence for this feature, as written to EPFCPCG.out, is
|
||
|
written with each exon beginning a new line:
|
||
|
|
||
|
>EPFCPCG:anticodon gtg
|
||
|
ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat
|
||
|
gcgggttcaattcccgtcg
|
||
|
ttcgcc
|
||
|
|
||
|
Finally, the expression that was evaluated to create this feature is
|
||
|
written to EPFCPCG.exp:
|
||
|
|
||
|
>EPFCPCG:anticodon gtg
|
||
|
@M81884:anticodon gtg
|
||
|
|
||
|
If EPFCPCG.exp was used as an expression file in option 2 (E) of FEATURES,
|
||
|
EPFCPCG.out would be recreated.
|
||
|
|
||
|
OPTIONS
|
||
|
1) FEATURES - choosing f will cause FEATURES to prompt for
|
||
|
a feature to extract. If you wish to extract several types of
|
||
|
features simultaneously (ie. F), you must construct a file listing the
|
||
|
feature keywords. The following example would retrieve both tRNA and
|
||
|
rRNA sequences:
|
||
|
|
||
|
OBJECTS
|
||
|
tRNA
|
||
|
rRNA
|
||
|
SITES
|
||
|
|
||
|
The words 'OBJECTS' and 'SITES' must enclose the feature keywords,
|
||
|
and each keyword must be on a separate line. For a rigorous
|
||
|
definition of the input file format, see the GETOB manual pages
|
||
|
(getob.doc).
|
||
|
|
||
|
In the menu shown above, f was chosen, and the user entered tRNA at
|
||
|
the prompt. Thus tRNA is now displayed on the Features: line. If
|
||
|
features had been specified from a file (suboption F) then the
|
||
|
filename containing the feature keywords would be displayed instead.
|
||
|
A complete list of legal feature keywords can be found in the GenBank
|
||
|
Release notes (gbrel.txt) under the subheading 'Feature Key Names'.
|
||
|
|
||
|
2) ENTRIES
|
||
|
n User is prompted for the name of an entry from which the
|
||
|
feature is to be extracted. The name of the entry will appear
|
||
|
on the 'Entries' line of the menu.
|
||
|
|
||
|
N User is prompted for a filename containing one or more
|
||
|
entry names. Each name must be on a separate line. The filename
|
||
|
will be displayed on the 'Entries' menu line.
|
||
|
|
||
|
a User is prompted for an accession number, which will appear
|
||
|
on the 'Entries' line of the menu.
|
||
|
|
||
|
A User is prompted for a filename for accession numbers. The filename
|
||
|
will appear on the 'Entries:' line.
|
||
|
|
||
|
e User is prompted for a GenBank Features expression of the
|
||
|
form accession:location.'accession' refers to a GenBank
|
||
|
accession number, while 'location' is any legal feature location.
|
||
|
A brief description of location syntax can be found under the
|
||
|
subheading "Feature Location" in the GenBank release notes
|
||
|
(gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table:
|
||
|
Definition" Version 1.04 for a complete definition.
|
||
|
E User is prompted for a filename containing one or more Feature
|
||
|
expressions. EACH EXPRESSION MUST BEGIN A '@'. All lines beginning
|
||
|
with '@' are processed as expressions, and all other lines are
|
||
|
copied to the output file unchanged.
|
||
|
|
||
|
Examples:
|
||
|
|
||
|
The tRNA shown above could have been extracted by choosing
|
||
|
suboption e and entering either of the following expressions:
|
||
|
|
||
|
M81884:complement(join(70023..70028,1..69))
|
||
|
M81884:anticodon gtg
|
||
|
|
||
|
In the first example, the feature line from the original entry
|
||
|
is used as the location. In the second example, the feature is
|
||
|
found by its qualifier line, which also appeared in the
|
||
|
original entry. It must be noted that the qualifier line must
|
||
|
be unique from others in the same entry in its first 15
|
||
|
characters after the = .
|
||
|
|
||
|
The flaL protein coding region of B. licheniformis is described
|
||
|
in GenBank entry BLIFALA, accession number M60287 in the
|
||
|
following feature:
|
||
|
|
||
|
CDS 305..640
|
||
|
/note="flaD (sin) homologue"
|
||
|
/gene="flaL"
|
||
|
/label=ORF2
|
||
|
/codon_start=1
|
||
|
|
||
|
This feature could be retrieved using any of the following
|
||
|
expressions:
|
||
|
|
||
|
M60287:305..640
|
||
|
M60287:ORF2
|
||
|
M60287:/label=ORF2
|
||
|
M60287:/gene="flaL"
|
||
|
M60287:/note="flaD (sin) homologue"
|
||
|
|
||
|
Note that the /label= qualifier is special, in that labels are
|
||
|
specifically intented as unique tags on an feature. For labels,
|
||
|
only the label itself is need be specified. Thus, /label=ORF2 is
|
||
|
equivalent to ORF2. For other qualifiers, the qualifier keyword
|
||
|
(eg. /note=) must be included.
|
||
|
|
||
|
3) DATABASE (WHERE TO GET IT) - By default, all entries processed will
|
||
|
be automatically retrieved from GenBank using FETCH. Specifying 'u'
|
||
|
(User-defined database subset) makes it possible to extract features
|
||
|
from GenBank subsets created by the user. Usually, retrieval of
|
||
|
features is much faster with a User-defined subset, so if you
|
||
|
frequently work with sets of genes, it is best to retrieve them
|
||
|
en-masse using FETCH, and work with them directly. For example, if
|
||
|
you had retrieved a set of Beta-globin sequences into a file called
|
||
|
'globin.gen', you could directly extract features from these entries
|
||
|
by specifying 'globin' or 'globin.gen' as your User-defined database.
|
||
|
If the file extension is '.gen', FEATURES will automatically create
|
||
|
temporary files called globin.ano, globin.wrp and globin.ind,
|
||
|
containing annotation, sequence, and an index, respectively. These
|
||
|
files will be read during feature extraction, and then discarded. If
|
||
|
you have already created such files using SPLITDB, simply specify
|
||
|
any of 'globin', 'globin.ano', etc. ie. anything, as long as it does
|
||
|
not have the .gen file extension.
|
||
|
|
||
|
'U' rather than 'u' causes ALL entries in the user-defined
|
||
|
database to be subset. This means that it is unnecessary to
|
||
|
specify entry options (eg -n, -N etc.), as these will be
|
||
|
ignored, if given.
|
||
|
|
||
|
One consequence of these conventions is that the individual GenBank
|
||
|
divisions can be processed directly. For example, suppose you were only
|
||
|
interested in rodent globins. You could directly access the rodent
|
||
|
division of GenBank by specifying the base name of that file division
|
||
|
(eg. /home/psgendb/GenBank/gbrod) as your user-defined database. In
|
||
|
this case, the files gbrod.ano, gbrod.wrp and gbrod.ind already
|
||
|
exist. Again, this approach is faster, since FEATURES would not have
|
||
|
to find and retrieve the sequences, but can read directly from the
|
||
|
database files. Finally, if you wanted to process all of the entries
|
||
|
in the database division, simply use -U. The user is warned that a
|
||
|
GenBank division is a huge amount of data, and processing every entry
|
||
|
could take a long time.
|
||
|
|
||
|
4) WHERE TO SEND IT - By default (a), the output for all entries goes
|
||
|
to a single set of files, whose names are chosen by FEATURES,
|
||
|
depending on the setting of option 2, Entries. If a single name (n) or
|
||
|
accession number (a) has been chosen, that will be used as
|
||
|
the raw filename. For example, if you were processing the entry
|
||
|
WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names
|
||
|
(N), accession numbers (A) or expressions (E) were read from a file,
|
||
|
the raw name of that file would be used eg. cellulase.nam would result
|
||
|
in cellulase.msg and cellulase.out. Finally, if a single expression
|
||
|
is processed (e), then the primary accession number in that
|
||
|
expression will be used for the filenames. In all cases, FEATURES
|
||
|
will tell you the names of the files being written.
|
||
|
|
||
|
Choosing suboption s, you can specify that the features created for
|
||
|
each entry be sent to separate files. In this case, each file will
|
||
|
have the name of that entry, with the extension .obj. However, all
|
||
|
messages and expressions will still go to a single files. While this
|
||
|
can be a convenient way of creating separate files when you need them,
|
||
|
this option still has the limitation of writing all features for a
|
||
|
given entry (if there are more than one) to the same file. Also,
|
||
|
successive resolution of features (anything requiring 'getob -r')
|
||
|
will not work with this option. This may be corrected in future
|
||
|
versions.
|
||
|
|
||
|
|
||
|
COMMAND LINE EXECUTION
|
||
|
|
||
|
There are two ways of running FEATURES from the command line. If only one
|
||
|
argument is supplied, that argument is interpreted as an expression, and
|
||
|
the result of that expression (ie. a sequence ) is written to the
|
||
|
standard output. .msg, .out and .exp files are NOT created. For example,
|
||
|
GenBank entry BACFLALA (M60287) contains the following feature:
|
||
|
|
||
|
CDS 95..271
|
||
|
/label=LORF-
|
||
|
/codon_start=1
|
||
|
/translation="MNKDKNEKEELDEEWTELIKHALEQGISPDDIRIFLNLGKKSSK
|
||
|
PSASIERSHSINPF"
|
||
|
Any of
|
||
|
|
||
|
features M60287:LORF-
|
||
|
features M60287:95..271
|
||
|
features M60287:/label=LORF-
|
||
|
|
||
|
would write the open reading frame to the standard output:
|
||
|
|
||
|
atgaataaagataaaaatgagaaagaagaattggatgaggagtggacaga
|
||
|
actgattaaacacgctcttgaacaaggcattagtccagacgatatacgta
|
||
|
tttttctcaatttgggtaagaagtcttcaaaaccttccgcatcaattgaa
|
||
|
agaagtcattcaataaatcctttctga
|
||
|
|
||
|
This form of FEATURES is provided to make it easy to pipe output to
|
||
|
other programs for further processing. For example
|
||
|
|
||
|
features M60287:LORF- |ribosome >LORF.protein
|
||
|
|
||
|
would write the translation of the open reading frame to a file called
|
||
|
LORF.protein.
|
||
|
|
||
|
The full functionality of the FEATURES can be accessed using arguments on
|
||
|
the command line. In particular, when there are multiple entries to be
|
||
|
processed, or multiple features within entries, it is much faster to
|
||
|
supply FEATURES with lists of entries, feature keys or expressions.
|
||
|
Command line options are similar to suboptions in menu items 1-3 above:
|
||
|
|
||
|
Feature keys:
|
||
|
-f key {feature key}
|
||
|
-F filename {file of feature keys}
|
||
|
|
||
|
Entries:
|
||
|
-n name {GenBank LOCUS name}
|
||
|
-N filename {file of GenBank LOCUS names}
|
||
|
-a accession {GenBank ACCESSION number}
|
||
|
-A filename {file of GenBank ACCESSION numbers}
|
||
|
-e expression {Feature Table expression}
|
||
|
-E filename {file of Feature Table expressions, each begin-
|
||
|
ning with '@'}
|
||
|
|
||
|
Databases:
|
||
|
-u filename {GenBank dataset}
|
||
|
-U filename { " " " " " " ,
|
||
|
process all entries ie. -nNaAeE options
|
||
|
will be ignored}
|
||
|
-g {GenBank}
|
||
|
|
||
|
Examples:
|
||
|
|
||
|
features -f tRNA -n EPFCPCG
|
||
|
|
||
|
retrieves all tRNAs from GenBank entry EPFCPCG and writes .msg, .out,
|
||
|
and .exp files.
|
||
|
|
||
|
features -e M60287:LORF-
|
||
|
|
||
|
would retrieve the same open reading frame as in the earlier example.
|
||
|
|
||
|
|
||
|
Since most time-consuming operation in FEATURES is sequence retrieval,
|
||
|
it is often best to retrieve frequently-used sequences as database
|
||
|
subsets. For example, a set GenBank entries for chlorophyl a/b binding
|
||
|
protein genes might be stored in a file called CAB.gen.
|
||
|
|
||
|
features -f CDS -N CAB.nam -u CAB.gen
|
||
|
|
||
|
would generate the files CAB.msg, CAB.out and CAB.exp containing output
|
||
|
for all CDS features in the entries listed in the file CAB.nam.
|
||
|
|
||
|
features -E CAB.exp -u CAB.gen
|
||
|
|
||
|
would re-create the output file CAB.out.
|
||
|
|
||
|
|
||
|
|
||
|
BUGS
|
||
|
FEATURES does no preliminary error checking for syntax of
|
||
|
GenBank expressions prior to their evaluation. Expressions that can
|
||
|
not be evaluated will be flagged by GETOB in the .msg file.
|
||
|
|
||
|
At present, little checking is done to test for the presence or
|
||
|
correctness of input files. Some errors may cause the program to
|
||
|
crash.
|
||
|
|
||
|
For User-defined datasets, filename expansion is not performed.
|
||
|
|
||
|
FILES
|
||
|
Temporary files:
|
||
|
X.term X.ano X.wrp X.ind X.gen {X is raw filename, see 4) }
|
||
|
UNRESOLVED.fea UNRESOLVED.out
|
||
|
FEA.inf FEA.nam FEA.gen FEA.ano FEA.wrp FEA.ind FEA.msg FEA.out
|
||
|
|
||
|
SEE ALSO
|
||
|
grep(1V) fetch getob splitdb
|
||
|
|
||
|
TRANSPORTATION NOTES
|
||
|
It should be fairly easy to get FEATURES to work even on systems
|
||
|
in which GenBank has not been formatted for the XYLEM package.
|
||
|
This is because FEATURES does not work directly on the database, but
|
||
|
rather retrieves all necessary sequences by calling FETCH. Thus,
|
||
|
statements like 'fetch FEA.nam FEA.gen' could be replaced with any
|
||
|
command that, given a file containing names or accession numbers,
|
||
|
returns a file containing GenBank entries. In principle, you
|
||
|
could even implement this sort of command to retrieve entries from
|
||
|
the email server (retrieve@ncbi.nlm.nih.gov) at NCBI, although
|
||
|
such a setup would undoubtedly be quite slow.
|
||
|
|
||
|
AUTHOR
|
||
|
Dr. Brian Fristensky
|
||
|
Dept. of Plant Science
|
||
|
University of Manitoba
|
||
|
Winnipeg, MB Canada R3T 2N2
|
||
|
Phone: 204-474-6085
|
||
|
FAX: 204-261-5732
|
||
|
frist@cc.umanitoba.ca
|
||
|
|
||
|
REFERENCE
|
||
|
Fristensky, B. (1993) Feature expressions: creating and manipulating
|
||
|
sequence datasets. Nucleic Acids Research 21:5997-6003.
|