gde_linux/CORE/xylem/splitdb.doc


      SPLITDB                                                 update 28 Mar 98


      NAME
            splitdb - split GenBank files into annotation, sequence, and index

      SYNOPSIS
            splitdb  [-gepvlct] dbfile anofile seqfile indfile

      DESCRIPTION
      Splitdb splits a database (dbfile) among three files: anofile, seqfile
      and indfile. Splitdb ignores any header information that might be in the
      file and begins processing at the first entry.

      anofile contains the annotation portion of each entry. Entries are
      terminated with '//' or '///' (PIR only). Trailing blanks present in
      dbfile are omitted in anofile.

      seqfile contains the sequence data for each entry. Each sequence
      entry begins with a header line, followed by sequence data on
      succeeding lines of 75 characters per line.  The header line
      includes the header flag character '>' in column 1, followed by the
      name, followed by the first 50 characters of the 1st
      DEFINITION line. An example is shown below:

      >UNHOR1 - Unicorn horn protein 1, complete cDNA sequence
      attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc
      ...

      Removal of blanks and numbers from sequence lines makes makes split
      datasets about 8-9% smaller than the original GenBank files.

      indfile is an index which tells the line numbers for each entry in
      anofile and seqfile.  It is assumed to be in alphabetical order by
      name. Each line contains a name and accession number, followed by the
      line numbers on which the annotation and sequence data begin in anofile
      and seqfile, respectively.  Thus the file plants.ind might contain:


	A15660       TA156608        1        1
	A15671       A15671         33       11
	A15673       A15673         65       25
	A15675       AK156751       97       36
	A15677       BA156770      128       46
	A16780       BA167807      160       57
	A16782       A16782        192       70
	ATHRPRP1C    GM905105      225       83
        etc...

      Note that indfile is a perfectly legitimate .nam file, for use with
      programs such as getloc, getob, or comm.


      The following options identify the type of database being read:

            -g    GenBank (default)
            -e    EMBL
            -p    PIR (NBRF)
            -v    Vecbase
            -l    LiMB

      Other options:
            -c    Compress 3 or more leading blanks in annotation lines
                  to take the form <CRUNCHFLAG><CRUNCHCHAR>, where CRUNCHFLAG
                  is the ASCII character specified by the Pascal const
                  CRUNCHOFFSET, which is set to 33 ("!") in the current
                  implementation. For each annotation line read, if the
                  number of leading blanks is >=3, splitdb sets CRUNCHCHAR
                  to CRUNCHOFFSET+the number of blanks. Thus, for lines
                  with 3, 4, or 5 leading blanks, CRUNCHCHAR would be
                  '$', '%' and '&', respectively. GETLOC and GETOB
                  automatically expand crunched blanks when CRUNCHFLAG
                  is encountered on an input line. Empiracle observations
                  indicate that the -c option decreases the size of
                  GenBank files by about 10%.

                  This compression method may fail when the number of
                  leading blanks exceeds 127-CRUNCHOFFSET. However,
                  none of the above mentioned databases currently
                  supports any datafield with anywhere near that number
                  of leading blanks.

             -t   (GenBank only) Append all information in the first
                  ORGANISM to the end of each line in indfile. For example,
                  the entry which begins:

     LOCUS       GORMTDLOOZ    282 bp    DNA             UNA       11-MAR-1996
     DEFINITION  GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon
                 Zoological Gardens) mitochondrial D-loop DNA.
     ACCESSION   L76759
     NID         g1222584
     KEYWORDS    D-loop.
     SOURCE      Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom,
            ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male
            DNA.
       ORGANISM  Mitochondrion Gorilla gorilla gorilla
                 Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
                 Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla.

                  might be indexed as

      GORMTDLOOZ   L76759   1    1 Mitochondrion Gorilla gorilla gorilla

                  This is useful for taxonomic studies, or as a way of making
                  it easy to create subsets from a single index. Thus,
                  'grep gorilla primates.ind' would print all lines in the
                  file that contained the word gorilla. The output from
                  this command could be used as a .nam file for extracting
                  just gorilla sequences from a larger dataset using
                  fetch.


     NOTES
       1. Header lines that aren't part of entries are automatically
       stripped out during processing. For example, in a file containing
       GenBank entries, all lines up to the first occurrence of 'LOCUS'
       starting in column 1, are ignored. Similarly for PIR, processing
       begins on the first line containing 'ENTRY' beginning in column 1.
       2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996,
       have accession numbers of 8 characters, rather than 6. Previously
       assigned accession numbers will remain at 6 characters. Splitdb has
       been updated to write all accession numbers to the .ind file, left
       justified in a field of 8 characters, in columns 14-21 of the .ind
       file.

     SEE ALSO
            getloc, getob, comm(1) (Unix command).

     AUTHOR
       Dr. Brian Fristensky
       Dept. of Plant Science
       University of Manitoba
       Winnipeg, MB  Canada  R3T 2N2
       Phone: 204-474-6085
       FAX: 204-261-5732
       frist@cc.umanitoba.ca

     REFERENCE
       Fristensky, B. (1993) Feature expressions: creating and manipulating
       sequence datasets. Nucleic Acids Research 21:5997-6003.