gde_linux/CORE/xylem/identify.doc

  
      IDENTIFY                                                update 3 Feb 94 
      
      
      NAME
            identify - creates a file of locus names corresponding to lines
            found by grep in a GenBank annotation file.
      
      SYNOPSIS
            identify grepfile indfile namefile findfile
      
      DESCRIPTION
      grepfile is created using the Unix grep command to search a .ano 
      file created by splitgb.  For example, to find all lines containing 
      the word 'chlorophyll' in plant.ano, use
      
           grep -n -i 'chlorophyll' plant.ano > plant.grep
      
      In the example shown, the -n option causes each line written to 
      plant.grep to be preceeded by the number of that line in plant.ano. 
      (The -i option causes grep to ignore case.) Identify can use the 
      indfile do determine which entry a given numbered line was found 
      in, and writes the corresponding LOCUS name to namefile.  In 
      addition, all lines found in a given entry are re-written to 
      findfile without the line numbers, and preceeded by the LOCUS name 
      for that entry. 
      
      EXAMPLES
      Suppose you wanted to obtain a list of names for all plant 
      sequences which code for proteins.  The task is complicated by the 
      fact that many fungal sequences are included in the GenBank plant 
      file.  You could begin by searching plant.ano (containing all 
      GenBank plant entries) for the word 'Planta':
      
      grep -n 'Planta' plant.ano > Planta.grep
      
      However, we want to eliminate all fungal sequences, as well as all 
      sequences for RNAs other than mRNAs.  If we create the file 
      bad.str containing the keywords
      
      Mycophyta
      tRNA
      rRNA
      uRNA
      
      we can then type 
      
      grep -n -f bad.str plant.ano > bad.grep
      
      bad.grep now contains all lines containing the offending keywords.  
      We next use identify to find the names of the entries found by 
      grep.
      
      identify Planta.grep plant.ind Planta.nam Planta.fnd
      identify bad.grep plant.ind bad.nam bad.fnd
      
      Next, we can use the Unix comm command to compare the two .nam 
      files and produce an output file containing only names which are 
      present in Planta.nam but not bad.nam:
      
      comm -23 Planta.nam bad.nam > plants.nam
      
      The file plants.nam now contains names of either plant cDNA or 
      genomic sequences which do not code for structural RNAs.
      At this point, getloc could to create a sub-database containing 
      only those entries listed in planta.nam.  See documentation for 
      getloc for a more detailed discussion. 
           
      SEE ALSO
            grep, fgrep, egrep, ngrep, comm, splitgb, getloc
      
     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.
2006 version init 2022-03-08 04:43:05 +08:00
			`IDENTIFY update 3 Feb 94`


			`NAME`
			`identify - creates a file of locus names corresponding to lines`
			`found by grep in a GenBank annotation file.`

			`SYNOPSIS`
			`identify grepfile indfile namefile findfile`

			`DESCRIPTION`
			`grepfile is created using the Unix grep command to search a .ano`
			`file created by splitgb. For example, to find all lines containing`
			`the word 'chlorophyll' in plant.ano, use`

			`grep -n -i 'chlorophyll' plant.ano > plant.grep`

			`In the example shown, the -n option causes each line written to`
			`plant.grep to be preceeded by the number of that line in plant.ano.`
			`(The -i option causes grep to ignore case.) Identify can use the`
			`indfile do determine which entry a given numbered line was found`
			`in, and writes the corresponding LOCUS name to namefile. In`
			`addition, all lines found in a given entry are re-written to`
			`findfile without the line numbers, and preceeded by the LOCUS name`
			`for that entry.`

			`EXAMPLES`
			`Suppose you wanted to obtain a list of names for all plant`
			`sequences which code for proteins. The task is complicated by the`
			`fact that many fungal sequences are included in the GenBank plant`
			`file. You could begin by searching plant.ano (containing all`
			`GenBank plant entries) for the word 'Planta':`

			`grep -n 'Planta' plant.ano > Planta.grep`

			`However, we want to eliminate all fungal sequences, as well as all`
			`sequences for RNAs other than mRNAs. If we create the file`
			`bad.str containing the keywords`

			`Mycophyta`
			`tRNA`
			`rRNA`
			`uRNA`

			`we can then type`

			`grep -n -f bad.str plant.ano > bad.grep`

			`bad.grep now contains all lines containing the offending keywords.`
			`We next use identify to find the names of the entries found by`
			`grep.`

			`identify Planta.grep plant.ind Planta.nam Planta.fnd`
			`identify bad.grep plant.ind bad.nam bad.fnd`

			`Next, we can use the Unix comm command to compare the two .nam`
			`files and produce an output file containing only names which are`
			`present in Planta.nam but not bad.nam:`

			`comm -23 Planta.nam bad.nam > plants.nam`

			`The file plants.nam now contains names of either plant cDNA or`
			`genomic sequences which do not code for structural RNAs.`
			`At this point, getloc could to create a sub-database containing`
			`only those entries listed in planta.nam. See documentation for`
			`getloc for a more detailed discussion.`

			`SEE ALSO`
			`grep, fgrep, egrep, ngrep, comm, splitgb, getloc`

			`AUTHOR`
			`Dr. Brian Fristensky`
			`Dept. of Plant Science`
			`University of Manitoba`
			`Winnipeg, MB Canada R3T 2N2`
			`Phone: 204-474-6085`
			`FAX: 204-261-5732`
			`frist@cc.umanitoba.ca`

			`REFERENCE`
			`Fristensky, B. (1993) Feature expressions: creating and manipulating`
			`sequence datasets. Nucleic Acids Research 21:5997-6003.`