gde_linux/CORE/xylem/splitdb.doc

142 lines
6.3 KiB
Plaintext

SPLITDB update 28 Mar 98
NAME
splitdb - split GenBank files into annotation, sequence, and index
SYNOPSIS
splitdb [-gepvlct] dbfile anofile seqfile indfile
DESCRIPTION
Splitdb splits a database (dbfile) among three files: anofile, seqfile
and indfile. Splitdb ignores any header information that might be in the
file and begins processing at the first entry.
anofile contains the annotation portion of each entry. Entries are
terminated with '//' or '///' (PIR only). Trailing blanks present in
dbfile are omitted in anofile.
seqfile contains the sequence data for each entry. Each sequence
entry begins with a header line, followed by sequence data on
succeeding lines of 75 characters per line. The header line
includes the header flag character '>' in column 1, followed by the
name, followed by the first 50 characters of the 1st
DEFINITION line. An example is shown below:
>UNHOR1 - Unicorn horn protein 1, complete cDNA sequence
attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc
...
Removal of blanks and numbers from sequence lines makes makes split
datasets about 8-9% smaller than the original GenBank files.
indfile is an index which tells the line numbers for each entry in
anofile and seqfile. It is assumed to be in alphabetical order by
name. Each line contains a name and accession number, followed by the
line numbers on which the annotation and sequence data begin in anofile
and seqfile, respectively. Thus the file plants.ind might contain:
A15660 TA156608 1 1
A15671 A15671 33 11
A15673 A15673 65 25
A15675 AK156751 97 36
A15677 BA156770 128 46
A16780 BA167807 160 57
A16782 A16782 192 70
ATHRPRP1C GM905105 225 83
etc...
Note that indfile is a perfectly legitimate .nam file, for use with
programs such as getloc, getob, or comm.
The following options identify the type of database being read:
-g GenBank (default)
-e EMBL
-p PIR (NBRF)
-v Vecbase
-l LiMB
Other options:
-c Compress 3 or more leading blanks in annotation lines
to take the form <CRUNCHFLAG><CRUNCHCHAR>, where CRUNCHFLAG
is the ASCII character specified by the Pascal const
CRUNCHOFFSET, which is set to 33 ("!") in the current
implementation. For each annotation line read, if the
number of leading blanks is >=3, splitdb sets CRUNCHCHAR
to CRUNCHOFFSET+the number of blanks. Thus, for lines
with 3, 4, or 5 leading blanks, CRUNCHCHAR would be
'$', '%' and '&', respectively. GETLOC and GETOB
automatically expand crunched blanks when CRUNCHFLAG
is encountered on an input line. Empiracle observations
indicate that the -c option decreases the size of
GenBank files by about 10%.
This compression method may fail when the number of
leading blanks exceeds 127-CRUNCHOFFSET. However,
none of the above mentioned databases currently
supports any datafield with anywhere near that number
of leading blanks.
-t (GenBank only) Append all information in the first
ORGANISM to the end of each line in indfile. For example,
the entry which begins:
LOCUS GORMTDLOOZ 282 bp DNA UNA 11-MAR-1996
DEFINITION GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon
Zoological Gardens) mitochondrial D-loop DNA.
ACCESSION L76759
NID g1222584
KEYWORDS D-loop.
SOURCE Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom,
ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male
DNA.
ORGANISM Mitochondrion Gorilla gorilla gorilla
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla.
might be indexed as
GORMTDLOOZ L76759 1 1 Mitochondrion Gorilla gorilla gorilla
This is useful for taxonomic studies, or as a way of making
it easy to create subsets from a single index. Thus,
'grep gorilla primates.ind' would print all lines in the
file that contained the word gorilla. The output from
this command could be used as a .nam file for extracting
just gorilla sequences from a larger dataset using
fetch.
NOTES
1. Header lines that aren't part of entries are automatically
stripped out during processing. For example, in a file containing
GenBank entries, all lines up to the first occurrence of 'LOCUS'
starting in column 1, are ignored. Similarly for PIR, processing
begins on the first line containing 'ENTRY' beginning in column 1.
2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996,
have accession numbers of 8 characters, rather than 6. Previously
assigned accession numbers will remain at 6 characters. Splitdb has
been updated to write all accession numbers to the .ind file, left
justified in a field of 8 characters, in columns 14-21 of the .ind
file.
SEE ALSO
getloc, getob, comm(1) (Unix command).
AUTHOR
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
Phone: 204-474-6085
FAX: 204-261-5732
frist@cc.umanitoba.ca
REFERENCE
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.