gde_linux/CORE/xylem/splitdb.doc


      SPLITDB                                                 update 28 Mar 98 
      
      
      NAME
            splitdb - split GenBank files into annotation, sequence, and index
      
      SYNOPSIS
            splitdb  [-gepvlct] dbfile anofile seqfile indfile
      
      DESCRIPTION
      Splitdb splits a database (dbfile) among three files: anofile, seqfile
      and indfile. Splitdb ignores any header information that might be in the
      file and begins processing at the first entry. 
      
      anofile contains the annotation portion of each entry. Entries are
      terminated with '//' or '///' (PIR only). Trailing blanks present in
      dbfile are omitted in anofile.
      
      seqfile contains the sequence data for each entry. Each sequence 
      entry begins with a header line, followed by sequence data on 
      succeeding lines of 75 characters per line.  The header line 
      includes the header flag character '>' in column 1, followed by the 
      name, followed by the first 50 characters of the 1st 
      DEFINITION line. An example is shown below:
      
      >UNHOR1 - Unicorn horn protein 1, complete cDNA sequence
      attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc
      ...
      
      Removal of blanks and numbers from sequence lines makes makes split
      datasets about 8-9% smaller than the original GenBank files.

      indfile is an index which tells the line numbers for each entry in 
      anofile and seqfile.  It is assumed to be in alphabetical order by 
      name. Each line contains a name and accession number, followed by the
      line numbers on which the annotation and sequence data begin in anofile 
      and seqfile, respectively.  Thus the file plants.ind might contain:


	A15660       TA156608        1        1
	A15671       A15671         33       11
	A15673       A15673         65       25
	A15675       AK156751       97       36
	A15677       BA156770      128       46
	A16780       BA167807      160       57
	A16782       A16782        192       70
	ATHRPRP1C    GM905105      225       83
        etc...
      
      Note that indfile is a perfectly legitimate .nam file, for use with 
      programs such as getloc, getob, or comm. 


      The following options identify the type of database being read:

            -g    GenBank (default)
            -e    EMBL
            -p    PIR (NBRF)
            -v    Vecbase
            -l    LiMB

      Other options:
            -c    Compress 3 or more leading blanks in annotation lines
                  to take the form <CRUNCHFLAG><CRUNCHCHAR>, where CRUNCHFLAG
                  is the ASCII character specified by the Pascal const 
                  CRUNCHOFFSET, which is set to 33 ("!") in the current 
                  implementation. For each annotation line read, if the
                  number of leading blanks is >=3, splitdb sets CRUNCHCHAR
                  to CRUNCHOFFSET+the number of blanks. Thus, for lines 
                  with 3, 4, or 5 leading blanks, CRUNCHCHAR would be
                  '$', '%' and '&', respectively. GETLOC and GETOB
                  automatically expand crunched blanks when CRUNCHFLAG
                  is encountered on an input line. Empiracle observations
                  indicate that the -c option decreases the size of 
                  GenBank files by about 10%.

                  This compression method may fail when the number of 
                  leading blanks exceeds 127-CRUNCHOFFSET. However, 
                  none of the above mentioned databases currently
                  supports any datafield with anywhere near that number
                  of leading blanks.
                  
             -t   (GenBank only) Append all information in the first
                  ORGANISM to the end of each line in indfile. For example,
                  the entry which begins:
                  
     LOCUS       GORMTDLOOZ    282 bp    DNA             UNA       11-MAR-1996
     DEFINITION  GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon
                 Zoological Gardens) mitochondrial D-loop DNA.
     ACCESSION   L76759
     NID         g1222584
     KEYWORDS    D-loop.
     SOURCE      Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom,
            ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male
            DNA.
       ORGANISM  Mitochondrion Gorilla gorilla gorilla
                 Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
                 Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla.
                  
                  might be indexed as
                  
      GORMTDLOOZ   L76759   1    1 Mitochondrion Gorilla gorilla gorilla
                 
                  This is useful for taxonomic studies, or as a way of making
                  it easy to create subsets from a single index. Thus,
                  'grep gorilla primates.ind' would print all lines in the
                  file that contained the word gorilla. The output from
                  this command could be used as a .nam file for extracting
                  just gorilla sequences from a larger dataset using 
                  fetch.
                 
                
     NOTES
       1. Header lines that aren't part of entries are automatically
       stripped out during processing. For example, in a file containing
       GenBank entries, all lines up to the first occurrence of 'LOCUS'
       starting in column 1, are ignored. Similarly for PIR, processing
       begins on the first line containing 'ENTRY' beginning in column 1.
       2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996,
       have accession numbers of 8 characters, rather than 6. Previously
       assigned accession numbers will remain at 6 characters. Splitdb has
       been updated to write all accession numbers to the .ind file, left
       justified in a field of 8 characters, in columns 14-21 of the .ind 
       file.  
  
     SEE ALSO
            getloc, getob, comm(1) (Unix command).

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.
2006 version init 2022-03-08 04:43:05 +08:00
			`SPLITDB update 28 Mar 98`


			`NAME`
			`splitdb - split GenBank files into annotation, sequence, and index`

			`SYNOPSIS`
			`splitdb [-gepvlct] dbfile anofile seqfile indfile`

			`DESCRIPTION`
			`Splitdb splits a database (dbfile) among three files: anofile, seqfile`
			`and indfile. Splitdb ignores any header information that might be in the`
			`file and begins processing at the first entry.`

			`anofile contains the annotation portion of each entry. Entries are`
			`terminated with '//' or '///' (PIR only). Trailing blanks present in`
			`dbfile are omitted in anofile.`

			`seqfile contains the sequence data for each entry. Each sequence`
			`entry begins with a header line, followed by sequence data on`
			`succeeding lines of 75 characters per line. The header line`
			`includes the header flag character '>' in column 1, followed by the`
			`name, followed by the first 50 characters of the 1st`
			`DEFINITION line. An example is shown below:`

			`>UNHOR1 - Unicorn horn protein 1, complete cDNA sequence`
			`attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc`
			`...`

			`Removal of blanks and numbers from sequence lines makes makes split`
			`datasets about 8-9% smaller than the original GenBank files.`

			`indfile is an index which tells the line numbers for each entry in`
			`anofile and seqfile. It is assumed to be in alphabetical order by`
			`name. Each line contains a name and accession number, followed by the`
			`line numbers on which the annotation and sequence data begin in anofile`
			`and seqfile, respectively. Thus the file plants.ind might contain:`


			`A15660 TA156608 1 1`
			`A15671 A15671 33 11`
			`A15673 A15673 65 25`
			`A15675 AK156751 97 36`
			`A15677 BA156770 128 46`
			`A16780 BA167807 160 57`
			`A16782 A16782 192 70`
			`ATHRPRP1C GM905105 225 83`
			`etc...`

			`Note that indfile is a perfectly legitimate .nam file, for use with`
			`programs such as getloc, getob, or comm.`


			`The following options identify the type of database being read:`

			`-g GenBank (default)`
			`-e EMBL`
			`-p PIR (NBRF)`
			`-v Vecbase`
			`-l LiMB`

			`Other options:`
			`-c Compress 3 or more leading blanks in annotation lines`
			`to take the form <CRUNCHFLAG><CRUNCHCHAR>, where CRUNCHFLAG`
			`is the ASCII character specified by the Pascal const`
			`CRUNCHOFFSET, which is set to 33 ("!") in the current`
			`implementation. For each annotation line read, if the`
			`number of leading blanks is >=3, splitdb sets CRUNCHCHAR`
			`to CRUNCHOFFSET+the number of blanks. Thus, for lines`
			`with 3, 4, or 5 leading blanks, CRUNCHCHAR would be`
			`'$', '%' and '&', respectively. GETLOC and GETOB`
			`automatically expand crunched blanks when CRUNCHFLAG`
			`is encountered on an input line. Empiracle observations`
			`indicate that the -c option decreases the size of`
			`GenBank files by about 10%.`

			`This compression method may fail when the number of`
			`leading blanks exceeds 127-CRUNCHOFFSET. However,`
			`none of the above mentioned databases currently`
			`supports any datafield with anywhere near that number`
			`of leading blanks.`

			`-t (GenBank only) Append all information in the first`
			`ORGANISM to the end of each line in indfile. For example,`
			`the entry which begins:`

			`LOCUS GORMTDLOOZ 282 bp DNA UNA 11-MAR-1996`
			`DEFINITION GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon`
			`Zoological Gardens) mitochondrial D-loop DNA.`
			`ACCESSION L76759`
			`NID g1222584`
			`KEYWORDS D-loop.`
			`SOURCE Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom,`
			`ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male`
			`DNA.`
			`ORGANISM Mitochondrion Gorilla gorilla gorilla`
			`Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;`
			`Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla.`

			`might be indexed as`

			`GORMTDLOOZ L76759 1 1 Mitochondrion Gorilla gorilla gorilla`

			`This is useful for taxonomic studies, or as a way of making`
			`it easy to create subsets from a single index. Thus,`
			`'grep gorilla primates.ind' would print all lines in the`
			`file that contained the word gorilla. The output from`
			`this command could be used as a .nam file for extracting`
			`just gorilla sequences from a larger dataset using`
			`fetch.`


			`NOTES`
			`1. Header lines that aren't part of entries are automatically`
			`stripped out during processing. For example, in a file containing`
			`GenBank entries, all lines up to the first occurrence of 'LOCUS'`
			`starting in column 1, are ignored. Similarly for PIR, processing`
			`begins on the first line containing 'ENTRY' beginning in column 1.`
			`2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996,`
			`have accession numbers of 8 characters, rather than 6. Previously`
			`assigned accession numbers will remain at 6 characters. Splitdb has`
			`been updated to write all accession numbers to the .ind file, left`
			`justified in a field of 8 characters, in columns 14-21 of the .ind`
			`file.`

			`SEE ALSO`
			`getloc, getob, comm(1) (Unix command).`

			`AUTHOR`
			`Dr. Brian Fristensky`
			`Dept. of Plant Science`
			`University of Manitoba`
			`Winnipeg, MB Canada R3T 2N2`
			`Phone: 204-474-6085`
			`FAX: 204-261-5732`
			`frist@cc.umanitoba.ca`

			`REFERENCE`
			`Fristensky, B. (1993) Feature expressions: creating and manipulating`
			`sequence datasets. Nucleic Acids Research 21:5997-6003.`