gde_linux/CORE/xylem/features.doc

408 lines
18 KiB
Plaintext

FEATURES.DOC update 7 Feb 94
NAME
FEATURES - extracts features from GenBank entries
SYNOPSIS
features
features expression
features [-f featurekey | -F keyfile]
[-n name |-a accession | -e expression |
-N namefile |-A accfile | -E expfile]
[-u dbfile | -U dbfile | -g ]
features -h
DESCRIPTION
FEATURES extracts sequence objects from GenBank entries, using
the Features Table language. Features can be retrieved either by
specifying keywords (eg. CDS, mRNA, exon, intron etc.) or by
evaluating expressions. In practical terms, FEATURES is actually
a user interface for GETOB, which actually performs the parsing
and extraction of sequence objects. FEATURES can be run either as
an interactive program or with command line arguments.
'features' with no arguments runs the program interactively.
'features' followed by an expression retrieves the data directly
from GenBank and evaluates the expression. The third form of
features requires all arguments to be accompanied by their
respective option flags. Finally, 'features -h' prints the
SYNOPSIS.
INTERACTIVE EXECUTION
FEATURES executed with no arguments runs interactively. An example of the
FEATURES menu is shown below:
___________________________________________________________________
FEATURES - Version 7 FEB 94
Please cite: Fristensky (1993) Nucl. Acids Res. 21:5997-6003
___________________________________________________________________
Features: tRNA
Entries: EPFCPCG
Dataset:
___________________________________________________________________
Parameter Description Value
-------------------------------------------------------------------
1).................... FEATURES TO EXTRACT ....................> f
f:Type a feature at the keyboard
F:Read a list of features from a file
2)....................ENTRIES TO BE PROCESSED (choose one).....> n
Keyboard input - n:name a:accession # e:expression
File input - N:name(s) A:accession #(s) E:expression(s)
3)....................WHERE TO GET IT .........................> g
u:Genbank dataset g:complete GenBank database
U: same as u, but all entries
4)....................WHERE TO SEND IT ........................> a
s:Each feature to a separate file a:All output to same file
---------------------------------------------------------------
Type number of your choice or 0 to continue:
0
Messages will be written to EPFCPCG.msg
Final sequence output will be written to EPFCPCG.out
Expressions will be written to EPFCPCG.exp
Extracting features...
In the example, FEATURES was instructed to retrieve all tRNAs from
the GenBank entry EPFCPCG, which contains the Epifagus plastid
genome. By default, the GenBank database was the source of the
sequence. Messages indicate the progress of the job. A log describing
the extraction of each feature is written to EPFCPCG.msg, while the
extracted features themselves are written to EPFCPCG.out. Feature
expressions which could be used by FEATURES to reconstruct the .out
file, are written to EPFCPCG.exp.
The first step is to retrieve the EPFCPCG entry from GenBank, which is
accomplished by calling FETCH. Next, FEATURES extracts the specified
features from the entry.
An excerpt from EPFCPCG.msg is shown below, describing the extraction
of the fifth tRNA found in this entry. To create this tRNA, two exons
had to be joined. The qualifier line associated with this feature
indicates that it is an Isoleucine tRNA with a gat anticodon.
EPFCPCG:anticodon gtg
complement
(
join
(
70023 70028
1 69
)
)
/product="transfer RNA-His"
/gene="His-tRNA"
/label=anticodon gtg
/note="anticodon gtg"
//----------------------------------------------
The actual sequence for this feature, as written to EPFCPCG.out, is
written with each exon beginning a new line:
>EPFCPCG:anticodon gtg
ggcggatgtagccaaatggatcaaggtagtggattgtgaatccaacatat
gcgggttcaattcccgtcg
ttcgcc
Finally, the expression that was evaluated to create this feature is
written to EPFCPCG.exp:
>EPFCPCG:anticodon gtg
@M81884:anticodon gtg
If EPFCPCG.exp was used as an expression file in option 2 (E) of FEATURES,
EPFCPCG.out would be recreated.
OPTIONS
1) FEATURES - choosing f will cause FEATURES to prompt for
a feature to extract. If you wish to extract several types of
features simultaneously (ie. F), you must construct a file listing the
feature keywords. The following example would retrieve both tRNA and
rRNA sequences:
OBJECTS
tRNA
rRNA
SITES
The words 'OBJECTS' and 'SITES' must enclose the feature keywords,
and each keyword must be on a separate line. For a rigorous
definition of the input file format, see the GETOB manual pages
(getob.doc).
In the menu shown above, f was chosen, and the user entered tRNA at
the prompt. Thus tRNA is now displayed on the Features: line. If
features had been specified from a file (suboption F) then the
filename containing the feature keywords would be displayed instead.
A complete list of legal feature keywords can be found in the GenBank
Release notes (gbrel.txt) under the subheading 'Feature Key Names'.
2) ENTRIES
n User is prompted for the name of an entry from which the
feature is to be extracted. The name of the entry will appear
on the 'Entries' line of the menu.
N User is prompted for a filename containing one or more
entry names. Each name must be on a separate line. The filename
will be displayed on the 'Entries' menu line.
a User is prompted for an accession number, which will appear
on the 'Entries' line of the menu.
A User is prompted for a filename for accession numbers. The filename
will appear on the 'Entries:' line.
e User is prompted for a GenBank Features expression of the
form accession:location.'accession' refers to a GenBank
accession number, while 'location' is any legal feature location.
A brief description of location syntax can be found under the
subheading "Feature Location" in the GenBank release notes
(gbrel.txt). See "The DDBJ/EMBL/GenBank Feature Table:
Definition" Version 1.04 for a complete definition.
E User is prompted for a filename containing one or more Feature
expressions. EACH EXPRESSION MUST BEGIN A '@'. All lines beginning
with '@' are processed as expressions, and all other lines are
copied to the output file unchanged.
Examples:
The tRNA shown above could have been extracted by choosing
suboption e and entering either of the following expressions:
M81884:complement(join(70023..70028,1..69))
M81884:anticodon gtg
In the first example, the feature line from the original entry
is used as the location. In the second example, the feature is
found by its qualifier line, which also appeared in the
original entry. It must be noted that the qualifier line must
be unique from others in the same entry in its first 15
characters after the = .
The flaL protein coding region of B. licheniformis is described
in GenBank entry BLIFALA, accession number M60287 in the
following feature:
CDS 305..640
/note="flaD (sin) homologue"
/gene="flaL"
/label=ORF2
/codon_start=1
This feature could be retrieved using any of the following
expressions:
M60287:305..640
M60287:ORF2
M60287:/label=ORF2
M60287:/gene="flaL"
M60287:/note="flaD (sin) homologue"
Note that the /label= qualifier is special, in that labels are
specifically intented as unique tags on an feature. For labels,
only the label itself is need be specified. Thus, /label=ORF2 is
equivalent to ORF2. For other qualifiers, the qualifier keyword
(eg. /note=) must be included.
3) DATABASE (WHERE TO GET IT) - By default, all entries processed will
be automatically retrieved from GenBank using FETCH. Specifying 'u'
(User-defined database subset) makes it possible to extract features
from GenBank subsets created by the user. Usually, retrieval of
features is much faster with a User-defined subset, so if you
frequently work with sets of genes, it is best to retrieve them
en-masse using FETCH, and work with them directly. For example, if
you had retrieved a set of Beta-globin sequences into a file called
'globin.gen', you could directly extract features from these entries
by specifying 'globin' or 'globin.gen' as your User-defined database.
If the file extension is '.gen', FEATURES will automatically create
temporary files called globin.ano, globin.wrp and globin.ind,
containing annotation, sequence, and an index, respectively. These
files will be read during feature extraction, and then discarded. If
you have already created such files using SPLITDB, simply specify
any of 'globin', 'globin.ano', etc. ie. anything, as long as it does
not have the .gen file extension.
'U' rather than 'u' causes ALL entries in the user-defined
database to be subset. This means that it is unnecessary to
specify entry options (eg -n, -N etc.), as these will be
ignored, if given.
One consequence of these conventions is that the individual GenBank
divisions can be processed directly. For example, suppose you were only
interested in rodent globins. You could directly access the rodent
division of GenBank by specifying the base name of that file division
(eg. /home/psgendb/GenBank/gbrod) as your user-defined database. In
this case, the files gbrod.ano, gbrod.wrp and gbrod.ind already
exist. Again, this approach is faster, since FEATURES would not have
to find and retrieve the sequences, but can read directly from the
database files. Finally, if you wanted to process all of the entries
in the database division, simply use -U. The user is warned that a
GenBank division is a huge amount of data, and processing every entry
could take a long time.
4) WHERE TO SEND IT - By default (a), the output for all entries goes
to a single set of files, whose names are chosen by FEATURES,
depending on the setting of option 2, Entries. If a single name (n) or
accession number (a) has been chosen, that will be used as
the raw filename. For example, if you were processing the entry
WHTCAB, the output files would be WHTCAB.msg and WHTCAB.out. If names
(N), accession numbers (A) or expressions (E) were read from a file,
the raw name of that file would be used eg. cellulase.nam would result
in cellulase.msg and cellulase.out. Finally, if a single expression
is processed (e), then the primary accession number in that
expression will be used for the filenames. In all cases, FEATURES
will tell you the names of the files being written.
Choosing suboption s, you can specify that the features created for
each entry be sent to separate files. In this case, each file will
have the name of that entry, with the extension .obj. However, all
messages and expressions will still go to a single files. While this
can be a convenient way of creating separate files when you need them,
this option still has the limitation of writing all features for a
given entry (if there are more than one) to the same file. Also,
successive resolution of features (anything requiring 'getob -r')
will not work with this option. This may be corrected in future
versions.
COMMAND LINE EXECUTION
There are two ways of running FEATURES from the command line. If only one
argument is supplied, that argument is interpreted as an expression, and
the result of that expression (ie. a sequence ) is written to the
standard output. .msg, .out and .exp files are NOT created. For example,
GenBank entry BACFLALA (M60287) contains the following feature:
CDS 95..271
/label=LORF-
/codon_start=1
/translation="MNKDKNEKEELDEEWTELIKHALEQGISPDDIRIFLNLGKKSSK
PSASIERSHSINPF"
Any of
features M60287:LORF-
features M60287:95..271
features M60287:/label=LORF-
would write the open reading frame to the standard output:
atgaataaagataaaaatgagaaagaagaattggatgaggagtggacaga
actgattaaacacgctcttgaacaaggcattagtccagacgatatacgta
tttttctcaatttgggtaagaagtcttcaaaaccttccgcatcaattgaa
agaagtcattcaataaatcctttctga
This form of FEATURES is provided to make it easy to pipe output to
other programs for further processing. For example
features M60287:LORF- |ribosome >LORF.protein
would write the translation of the open reading frame to a file called
LORF.protein.
The full functionality of the FEATURES can be accessed using arguments on
the command line. In particular, when there are multiple entries to be
processed, or multiple features within entries, it is much faster to
supply FEATURES with lists of entries, feature keys or expressions.
Command line options are similar to suboptions in menu items 1-3 above:
Feature keys:
-f key {feature key}
-F filename {file of feature keys}
Entries:
-n name {GenBank LOCUS name}
-N filename {file of GenBank LOCUS names}
-a accession {GenBank ACCESSION number}
-A filename {file of GenBank ACCESSION numbers}
-e expression {Feature Table expression}
-E filename {file of Feature Table expressions, each begin-
ning with '@'}
Databases:
-u filename {GenBank dataset}
-U filename { " " " " " " ,
process all entries ie. -nNaAeE options
will be ignored}
-g {GenBank}
Examples:
features -f tRNA -n EPFCPCG
retrieves all tRNAs from GenBank entry EPFCPCG and writes .msg, .out,
and .exp files.
features -e M60287:LORF-
would retrieve the same open reading frame as in the earlier example.
Since most time-consuming operation in FEATURES is sequence retrieval,
it is often best to retrieve frequently-used sequences as database
subsets. For example, a set GenBank entries for chlorophyl a/b binding
protein genes might be stored in a file called CAB.gen.
features -f CDS -N CAB.nam -u CAB.gen
would generate the files CAB.msg, CAB.out and CAB.exp containing output
for all CDS features in the entries listed in the file CAB.nam.
features -E CAB.exp -u CAB.gen
would re-create the output file CAB.out.
BUGS
FEATURES does no preliminary error checking for syntax of
GenBank expressions prior to their evaluation. Expressions that can
not be evaluated will be flagged by GETOB in the .msg file.
At present, little checking is done to test for the presence or
correctness of input files. Some errors may cause the program to
crash.
For User-defined datasets, filename expansion is not performed.
FILES
Temporary files:
X.term X.ano X.wrp X.ind X.gen {X is raw filename, see 4) }
UNRESOLVED.fea UNRESOLVED.out
FEA.inf FEA.nam FEA.gen FEA.ano FEA.wrp FEA.ind FEA.msg FEA.out
SEE ALSO
grep(1V) fetch getob splitdb
TRANSPORTATION NOTES
It should be fairly easy to get FEATURES to work even on systems
in which GenBank has not been formatted for the XYLEM package.
This is because FEATURES does not work directly on the database, but
rather retrieves all necessary sequences by calling FETCH. Thus,
statements like 'fetch FEA.nam FEA.gen' could be replaced with any
command that, given a file containing names or accession numbers,
returns a file containing GenBank entries. In principle, you
could even implement this sort of command to retrieve entries from
the email server (retrieve@ncbi.nlm.nih.gov) at NCBI, although
such a setup would undoubtedly be quite slow.
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca
REFERENCE
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.