gde_linux/CORE/xylem/identify.doc


      IDENTIFY                                                update 3 Feb 94


      NAME
            identify - creates a file of locus names corresponding to lines
            found by grep in a GenBank annotation file.

      SYNOPSIS
            identify grepfile indfile namefile findfile

      DESCRIPTION
      grepfile is created using the Unix grep command to search a .ano
      file created by splitgb.  For example, to find all lines containing
      the word 'chlorophyll' in plant.ano, use

           grep -n -i 'chlorophyll' plant.ano > plant.grep

      In the example shown, the -n option causes each line written to
      plant.grep to be preceeded by the number of that line in plant.ano.
      (The -i option causes grep to ignore case.) Identify can use the
      indfile do determine which entry a given numbered line was found
      in, and writes the corresponding LOCUS name to namefile.  In
      addition, all lines found in a given entry are re-written to
      findfile without the line numbers, and preceeded by the LOCUS name
      for that entry.

      EXAMPLES
      Suppose you wanted to obtain a list of names for all plant
      sequences which code for proteins.  The task is complicated by the
      fact that many fungal sequences are included in the GenBank plant
      file.  You could begin by searching plant.ano (containing all
      GenBank plant entries) for the word 'Planta':

      grep -n 'Planta' plant.ano > Planta.grep

      However, we want to eliminate all fungal sequences, as well as all
      sequences for RNAs other than mRNAs.  If we create the file
      bad.str containing the keywords

      Mycophyta
      tRNA
      rRNA
      uRNA

      we can then type

      grep -n -f bad.str plant.ano > bad.grep

      bad.grep now contains all lines containing the offending keywords.
      We next use identify to find the names of the entries found by
      grep.

      identify Planta.grep plant.ind Planta.nam Planta.fnd
      identify bad.grep plant.ind bad.nam bad.fnd

      Next, we can use the Unix comm command to compare the two .nam
      files and produce an output file containing only names which are
      present in Planta.nam but not bad.nam:

      comm -23 Planta.nam bad.nam > plants.nam

      The file plants.nam now contains names of either plant cDNA or
      genomic sequences which do not code for structural RNAs.
      At this point, getloc could to create a sub-database containing
      only those entries listed in planta.nam.  See documentation for
      getloc for a more detailed discussion.

      SEE ALSO
            grep, fgrep, egrep, ngrep, comm, splitgb, getloc

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.