83 lines
3.1 KiB
Text
83 lines
3.1 KiB
Text
|
|
IDENTIFY update 3 Feb 94
|
|
|
|
|
|
NAME
|
|
identify - creates a file of locus names corresponding to lines
|
|
found by grep in a GenBank annotation file.
|
|
|
|
SYNOPSIS
|
|
identify grepfile indfile namefile findfile
|
|
|
|
DESCRIPTION
|
|
grepfile is created using the Unix grep command to search a .ano
|
|
file created by splitgb. For example, to find all lines containing
|
|
the word 'chlorophyll' in plant.ano, use
|
|
|
|
grep -n -i 'chlorophyll' plant.ano > plant.grep
|
|
|
|
In the example shown, the -n option causes each line written to
|
|
plant.grep to be preceeded by the number of that line in plant.ano.
|
|
(The -i option causes grep to ignore case.) Identify can use the
|
|
indfile do determine which entry a given numbered line was found
|
|
in, and writes the corresponding LOCUS name to namefile. In
|
|
addition, all lines found in a given entry are re-written to
|
|
findfile without the line numbers, and preceeded by the LOCUS name
|
|
for that entry.
|
|
|
|
EXAMPLES
|
|
Suppose you wanted to obtain a list of names for all plant
|
|
sequences which code for proteins. The task is complicated by the
|
|
fact that many fungal sequences are included in the GenBank plant
|
|
file. You could begin by searching plant.ano (containing all
|
|
GenBank plant entries) for the word 'Planta':
|
|
|
|
grep -n 'Planta' plant.ano > Planta.grep
|
|
|
|
However, we want to eliminate all fungal sequences, as well as all
|
|
sequences for RNAs other than mRNAs. If we create the file
|
|
bad.str containing the keywords
|
|
|
|
Mycophyta
|
|
tRNA
|
|
rRNA
|
|
uRNA
|
|
|
|
we can then type
|
|
|
|
grep -n -f bad.str plant.ano > bad.grep
|
|
|
|
bad.grep now contains all lines containing the offending keywords.
|
|
We next use identify to find the names of the entries found by
|
|
grep.
|
|
|
|
identify Planta.grep plant.ind Planta.nam Planta.fnd
|
|
identify bad.grep plant.ind bad.nam bad.fnd
|
|
|
|
Next, we can use the Unix comm command to compare the two .nam
|
|
files and produce an output file containing only names which are
|
|
present in Planta.nam but not bad.nam:
|
|
|
|
comm -23 Planta.nam bad.nam > plants.nam
|
|
|
|
The file plants.nam now contains names of either plant cDNA or
|
|
genomic sequences which do not code for structural RNAs.
|
|
At this point, getloc could to create a sub-database containing
|
|
only those entries listed in planta.nam. See documentation for
|
|
getloc for a more detailed discussion.
|
|
|
|
SEE ALSO
|
|
grep, fgrep, egrep, ngrep, comm, splitgb, getloc
|
|
|
|
AUTHOR
|
|
Dr. Brian Fristensky
|
|
Dept. of Plant Science
|
|
University of Manitoba
|
|
Winnipeg, MB Canada R3T 2N2
|
|
Phone: 204-474-6085
|
|
FAX: 204-261-5732
|
|
frist@cc.umanitoba.ca
|
|
|
|
REFERENCE
|
|
Fristensky, B. (1993) Feature expressions: creating and manipulating
|
|
sequence datasets. Nucleic Acids Research 21:5997-6003.
|