141 lines
6.3 KiB
Text
141 lines
6.3 KiB
Text
|
|
SPLITDB update 28 Mar 98
|
|
|
|
|
|
NAME
|
|
splitdb - split GenBank files into annotation, sequence, and index
|
|
|
|
SYNOPSIS
|
|
splitdb [-gepvlct] dbfile anofile seqfile indfile
|
|
|
|
DESCRIPTION
|
|
Splitdb splits a database (dbfile) among three files: anofile, seqfile
|
|
and indfile. Splitdb ignores any header information that might be in the
|
|
file and begins processing at the first entry.
|
|
|
|
anofile contains the annotation portion of each entry. Entries are
|
|
terminated with '//' or '///' (PIR only). Trailing blanks present in
|
|
dbfile are omitted in anofile.
|
|
|
|
seqfile contains the sequence data for each entry. Each sequence
|
|
entry begins with a header line, followed by sequence data on
|
|
succeeding lines of 75 characters per line. The header line
|
|
includes the header flag character '>' in column 1, followed by the
|
|
name, followed by the first 50 characters of the 1st
|
|
DEFINITION line. An example is shown below:
|
|
|
|
>UNHOR1 - Unicorn horn protein 1, complete cDNA sequence
|
|
attcctctatagtctattctagctagccaaataggttagatggctgtcttactacttacgc
|
|
...
|
|
|
|
Removal of blanks and numbers from sequence lines makes makes split
|
|
datasets about 8-9% smaller than the original GenBank files.
|
|
|
|
indfile is an index which tells the line numbers for each entry in
|
|
anofile and seqfile. It is assumed to be in alphabetical order by
|
|
name. Each line contains a name and accession number, followed by the
|
|
line numbers on which the annotation and sequence data begin in anofile
|
|
and seqfile, respectively. Thus the file plants.ind might contain:
|
|
|
|
|
|
A15660 TA156608 1 1
|
|
A15671 A15671 33 11
|
|
A15673 A15673 65 25
|
|
A15675 AK156751 97 36
|
|
A15677 BA156770 128 46
|
|
A16780 BA167807 160 57
|
|
A16782 A16782 192 70
|
|
ATHRPRP1C GM905105 225 83
|
|
etc...
|
|
|
|
Note that indfile is a perfectly legitimate .nam file, for use with
|
|
programs such as getloc, getob, or comm.
|
|
|
|
|
|
The following options identify the type of database being read:
|
|
|
|
-g GenBank (default)
|
|
-e EMBL
|
|
-p PIR (NBRF)
|
|
-v Vecbase
|
|
-l LiMB
|
|
|
|
Other options:
|
|
-c Compress 3 or more leading blanks in annotation lines
|
|
to take the form <CRUNCHFLAG><CRUNCHCHAR>, where CRUNCHFLAG
|
|
is the ASCII character specified by the Pascal const
|
|
CRUNCHOFFSET, which is set to 33 ("!") in the current
|
|
implementation. For each annotation line read, if the
|
|
number of leading blanks is >=3, splitdb sets CRUNCHCHAR
|
|
to CRUNCHOFFSET+the number of blanks. Thus, for lines
|
|
with 3, 4, or 5 leading blanks, CRUNCHCHAR would be
|
|
'$', '%' and '&', respectively. GETLOC and GETOB
|
|
automatically expand crunched blanks when CRUNCHFLAG
|
|
is encountered on an input line. Empiracle observations
|
|
indicate that the -c option decreases the size of
|
|
GenBank files by about 10%.
|
|
|
|
This compression method may fail when the number of
|
|
leading blanks exceeds 127-CRUNCHOFFSET. However,
|
|
none of the above mentioned databases currently
|
|
supports any datafield with anywhere near that number
|
|
of leading blanks.
|
|
|
|
-t (GenBank only) Append all information in the first
|
|
ORGANISM to the end of each line in indfile. For example,
|
|
the entry which begins:
|
|
|
|
LOCUS GORMTDLOOZ 282 bp DNA UNA 11-MAR-1996
|
|
DEFINITION GGGOMT493; Gorilla gorilla gorilla (BomBom, ISIS 438, Audubon
|
|
Zoological Gardens) mitochondrial D-loop DNA.
|
|
ACCESSION L76759
|
|
NID g1222584
|
|
KEYWORDS D-loop.
|
|
SOURCE Mitochondrion Gorilla gorilla gorilla (individual_isolate BomBom,
|
|
ISIS 438, Audubon Zoological Gardens, sub_species gorilla) male
|
|
DNA.
|
|
ORGANISM Mitochondrion Gorilla gorilla gorilla
|
|
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
|
|
Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Gorilla.
|
|
|
|
might be indexed as
|
|
|
|
GORMTDLOOZ L76759 1 1 Mitochondrion Gorilla gorilla gorilla
|
|
|
|
This is useful for taxonomic studies, or as a way of making
|
|
it easy to create subsets from a single index. Thus,
|
|
'grep gorilla primates.ind' would print all lines in the
|
|
file that contained the word gorilla. The output from
|
|
this command could be used as a .nam file for extracting
|
|
just gorilla sequences from a larger dataset using
|
|
fetch.
|
|
|
|
|
|
NOTES
|
|
1. Header lines that aren't part of entries are automatically
|
|
stripped out during processing. For example, in a file containing
|
|
GenBank entries, all lines up to the first occurrence of 'LOCUS'
|
|
starting in column 1, are ignored. Similarly for PIR, processing
|
|
begins on the first line containing 'ENTRY' beginning in column 1.
|
|
2. GenBank/EMBL/DDBJ entries created on or after Feb. 1, 1996,
|
|
have accession numbers of 8 characters, rather than 6. Previously
|
|
assigned accession numbers will remain at 6 characters. Splitdb has
|
|
been updated to write all accession numbers to the .ind file, left
|
|
justified in a field of 8 characters, in columns 14-21 of the .ind
|
|
file.
|
|
|
|
SEE ALSO
|
|
getloc, getob, comm(1) (Unix command).
|
|
|
|
AUTHOR
|
|
Dr. Brian Fristensky
|
|
Dept. of Plant Science
|
|
University of Manitoba
|
|
Winnipeg, MB Canada R3T 2N2
|
|
Phone: 204-474-6085
|
|
FAX: 204-261-5732
|
|
frist@cc.umanitoba.ca
|
|
|
|
REFERENCE
|
|
Fristensky, B. (1993) Feature expressions: creating and manipulating
|
|
sequence datasets. Nucleic Acids Research 21:5997-6003.
|