gde_linux/CORE/xylem/identify.doc

84 lines
3.1 KiB
Plaintext

IDENTIFY update 3 Feb 94
NAME
identify - creates a file of locus names corresponding to lines
found by grep in a GenBank annotation file.
SYNOPSIS
identify grepfile indfile namefile findfile
DESCRIPTION
grepfile is created using the Unix grep command to search a .ano
file created by splitgb. For example, to find all lines containing
the word 'chlorophyll' in plant.ano, use
grep -n -i 'chlorophyll' plant.ano > plant.grep
In the example shown, the -n option causes each line written to
plant.grep to be preceeded by the number of that line in plant.ano.
(The -i option causes grep to ignore case.) Identify can use the
indfile do determine which entry a given numbered line was found
in, and writes the corresponding LOCUS name to namefile. In
addition, all lines found in a given entry are re-written to
findfile without the line numbers, and preceeded by the LOCUS name
for that entry.
EXAMPLES
Suppose you wanted to obtain a list of names for all plant
sequences which code for proteins. The task is complicated by the
fact that many fungal sequences are included in the GenBank plant
file. You could begin by searching plant.ano (containing all
GenBank plant entries) for the word 'Planta':
grep -n 'Planta' plant.ano > Planta.grep
However, we want to eliminate all fungal sequences, as well as all
sequences for RNAs other than mRNAs. If we create the file
bad.str containing the keywords
Mycophyta
tRNA
rRNA
uRNA
we can then type
grep -n -f bad.str plant.ano > bad.grep
bad.grep now contains all lines containing the offending keywords.
We next use identify to find the names of the entries found by
grep.
identify Planta.grep plant.ind Planta.nam Planta.fnd
identify bad.grep plant.ind bad.nam bad.fnd
Next, we can use the Unix comm command to compare the two .nam
files and produce an output file containing only names which are
present in Planta.nam but not bad.nam:
comm -23 Planta.nam bad.nam > plants.nam
The file plants.nam now contains names of either plant cDNA or
genomic sequences which do not code for structural RNAs.
At this point, getloc could to create a sub-database containing
only those entries listed in planta.nam. See documentation for
getloc for a more detailed discussion.
SEE ALSO
grep, fgrep, egrep, ngrep, comm, splitgb, getloc
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca
REFERENCE
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.