421 lines
14 KiB
Text
421 lines
14 KiB
Text
|
Notes on library handling
|
||
|
-------------------------
|
||
|
|
||
|
Contents of this document:
|
||
|
|
||
|
I) Introduction
|
||
|
II) Details of file organisation and use
|
||
|
III) Options currently available
|
||
|
IV) Installation guide
|
||
|
V) New feature table handling routines
|
||
|
VI) Indexing the sequence libraries
|
||
|
|
||
|
|
||
|
Section I Introduction
|
||
|
----------------------
|
||
|
|
||
|
Available sequence libraries
|
||
|
|
||
|
There are a number of different sequence libraries for nucleotide and protein:
|
||
|
PIR, GenBank, EMBL, Swissprot, and the Japanese Databank. Even after all the
|
||
|
years of their existence they still use different formats for their data. This
|
||
|
provides tedious and unrewarding work for software developers. Recently EMBL
|
||
|
and GenBank agreed a new and common way of writing their feature tables, which
|
||
|
is great help, although the rest of their format is different. Swissprot still
|
||
|
uses the old embl style feature table format and PIR yet another.
|
||
|
|
||
|
All the libraries distribute their data on magnetic tapes and EMBL and GenBank
|
||
|
have started to distribute on cdrom. The EMBL cdrom also contains Swissprot.
|
||
|
The GenBank and EMBL cdroms use different formats and have different contents.
|
||
|
The EMBL cdrom has useful indexes sorted alphabetically: those for entry name
|
||
|
and accession number, brief descriptions, keywords and freetext indexes are
|
||
|
already available and others are expected. These indexes point to the data for
|
||
|
each entry, and can be used to extract the data for any entry quickly.
|
||
|
|
||
|
Moving to unix
|
||
|
|
||
|
The VAX version of our package used PIR format which meant reformatting all
|
||
|
libraries other than PIR into that format. This required, at least
|
||
|
temporarily, having space for two copies of the libraries, and quite a lot of
|
||
|
cpu time. The software for doing this was provided by PIR, and is very VAX
|
||
|
specific and hence will not run under unix. For the unix version of our package
|
||
|
I have decided to use the EMBL cdrom format and its indexes as the primary
|
||
|
format. The current programs also support the use of PIR format libraries
|
||
|
without indexes - ie just the sequence and annotation files.
|
||
|
|
||
|
Indexing GenBank, EMBL updates, PIR and NRL3D
|
||
|
|
||
|
We include programs to create indexes for the above libraries. See below and
|
||
|
the README file in indexseqlibs. The programs can read all the above libraries
|
||
|
once the indexes are created. The indexing programs index the data in its
|
||
|
distributed form: WE DO NOT REFORMAT OR COPY THE LIBRARIES but simply create
|
||
|
indexes to the original files. Obviously this saves a lot of disk space, and
|
||
|
for those content to use only embl and swissprot from the cdrom, almost no disk
|
||
|
space is required. We havent tried it yet, but for genbank on cdrom, the only
|
||
|
extra disk space required would be for the indexes.
|
||
|
|
||
|
---------------------------------------------------------------------------
|
||
|
|
||
|
Section II Details of file organisation and use
|
||
|
-----------------------------------------------
|
||
|
|
||
|
The following strategy has been used to try to deal with alternate
|
||
|
and changing sequence library formats.
|
||
|
|
||
|
1) libraries are described at several levels:
|
||
|
|
||
|
a) the top level file is a list of available libraries which contains:
|
||
|
the library type, the name of the file containing the name of
|
||
|
each libraries individual files, and the prompt to appear on
|
||
|
the users screen: LTYPE LOGNAM PROMPT
|
||
|
|
||
|
b) the file containing the names of the libraries individual files
|
||
|
contains flags to define the file types: FTYPE LOGNAM
|
||
|
|
||
|
c) the individual library files
|
||
|
|
||
|
|
||
|
|
||
|
2) libary types handled:
|
||
|
|
||
|
a) EMBL/SWISSPROT in distributed format with cdrom index format
|
||
|
LTYPE = 'A'
|
||
|
b) GenBank in distributed format with cdrom index format LTYPE = 'C'
|
||
|
c) PIR/NRL3D in CODATA format with cdrom index format LTYPE = 'B'
|
||
|
d) PIR/NBRF .seq files can be read sequentially as "personal files
|
||
|
in PIR format" and do not appear in the list of available libraries.
|
||
|
e) FASTA format files can be read sequentially as "personal files
|
||
|
in FASTA format" and do not appear in the list of available
|
||
|
libraries.
|
||
|
|
||
|
3) EMBL, SWISSPROT and other libraries for which EMBL-style indexes have been
|
||
|
created
|
||
|
|
||
|
current file types:
|
||
|
|
||
|
A division.lookup
|
||
|
B entryname.index
|
||
|
C accession.target
|
||
|
D accession.hits
|
||
|
E brief description
|
||
|
F freetext.target
|
||
|
G freetext.hits
|
||
|
H author.target
|
||
|
I author.hits
|
||
|
|
||
|
|
||
|
Library list
|
||
|
level 1
|
||
|
|
|
||
|
|
|
||
|
-----------------------------------------------------------
|
||
|
| | |
|
||
|
lib 1 file list lib 2 file list lib 3 file list
|
||
|
level 2
|
||
|
| |
|
||
|
-------- ---------
|
||
|
level 3
|
||
|
file 1 file 1
|
||
|
file 2 file 2
|
||
|
. .
|
||
|
file n file n
|
||
|
|
||
|
---------------------------------------------------------------------------
|
||
|
|
||
|
|
||
|
Example
|
||
|
-------
|
||
|
|
||
|
Level 1
|
||
|
|
||
|
File name: sequence.libs
|
||
|
Environment variable: SEQUENCELIBRARIES
|
||
|
Contents:
|
||
|
|
||
|
A EMBLFILES EMBL nucleotide library ! in cdrom format
|
||
|
C GENBFILES GenBank nucleotide library!
|
||
|
A SWISSFILES SWISSPROT protein library! in cdrom format
|
||
|
B PIRFILES PIR protein library!
|
||
|
B NRL3DFILES NRL3D protein library!
|
||
|
|
||
|
Notes:
|
||
|
|
||
|
The libraries have types A,B,C. The logical names are EMBLLIBDESCRP and
|
||
|
SWISSLIBDESCRP, etc and the prompts are 'EMBL nucleotide library' and
|
||
|
'SWISSPROT protein library', etc. Anything to the right of a ! is a comment.
|
||
|
|
||
|
Level 2: the list of library files (using embl as an example)
|
||
|
|
||
|
File name: embl.files
|
||
|
Environment variable: EMBLFILES
|
||
|
Contents:
|
||
|
|
||
|
A EMBLDIVPATH/embl_div.lkp
|
||
|
B EMBLINDPATH/entrynam.idx
|
||
|
C EMBLINDPATH/acnum.trg
|
||
|
D EMBLINDPATH/acnum.hit
|
||
|
E EMBLINDPATH/brief.idx
|
||
|
F EMBLINDPATH/freetext.trg
|
||
|
G EMBLINDPATH/freetext.hit
|
||
|
H EMBLINDPATH/author.trg
|
||
|
I EMBLINDPATH/author.hit
|
||
|
|
||
|
|
||
|
Level 3: the sequence and annotation files (eg 15 for embl, 1 for swissprot).
|
||
|
|
||
|
Paths and file names:
|
||
|
|
||
|
EMBLPATH/bb.dat
|
||
|
EMBLPATH/fun.dat
|
||
|
EMBLPATH/inv.dat
|
||
|
EMBLPATH/mam.dat
|
||
|
EMBLPATH/org.dat
|
||
|
EMBLPATH/patent.dat
|
||
|
EMBLPATH/phg.dat
|
||
|
EMBLPATH/pln.dat
|
||
|
EMBLPATH/pri.dat
|
||
|
EMBLPATH/pro.dat
|
||
|
EMBLPATH/rod.dat
|
||
|
EMBLPATH/syn.dat
|
||
|
EMBLPATH/una.dat
|
||
|
EMBLPATH/vrl.dat
|
||
|
EMBLPATH/vrt.dat
|
||
|
|
||
|
All files from the division lookup file down are exactly as they appear on the
|
||
|
cdrom. The division lookup file relates numbers stored in the indexes to
|
||
|
actual division (or data) files stored on the disk. We rewrite it so the
|
||
|
directory structure and file names can be chosen locally. Its format is
|
||
|
I6,1x,A. An example is given below.
|
||
|
|
||
|
Division lookup file
|
||
|
|
||
|
File name: STADTABL/embl_div.lkp
|
||
|
Environment variable path EMBLDIVPATH
|
||
|
Contents:
|
||
|
|
||
|
1 EMBLPATH/bb.dat
|
||
|
2 EMBLPATH/fun.dat
|
||
|
3 EMBLPATH/inv.dat
|
||
|
4 EMBLPATH/mam.dat
|
||
|
5 EMBLPATH/org.dat
|
||
|
6 EMBLPATH/patent.dat
|
||
|
7 EMBLPATH/phg.dat
|
||
|
8 EMBLPATH/pln.dat
|
||
|
9 EMBLPATH/pri.dat
|
||
|
10 EMBLPATH/pro.dat
|
||
|
11 EMBLPATH/rod.dat
|
||
|
12 EMBLPATH/syn.dat
|
||
|
13 EMBLPATH/una.dat
|
||
|
14 EMBLPATH/vrl.dat
|
||
|
15 EMBLPATH/vrt.dat
|
||
|
---------------------------------------------------------------------------
|
||
|
|
||
|
|
||
|
Section III Options currently available
|
||
|
---------------------------------------
|
||
|
|
||
|
Facilities currently offered in nip,pip,sip,nipl,pipl,sipl:
|
||
|
|
||
|
Get a sequence by knowing its entry name
|
||
|
Get a sequences' annotation by knowing its entry name
|
||
|
Get an entry name by knowing its accession number
|
||
|
Search the freetext index
|
||
|
Search the author index
|
||
|
|
||
|
Facilities currently offered in nipl,pipl,sipl:
|
||
|
|
||
|
Search whole library
|
||
|
Search only a list of entry names
|
||
|
Search all but a list of entry names
|
||
|
|
||
|
Outline of each type of operation
|
||
|
|
||
|
Looking for an entry by name: the programs will open the library description
|
||
|
file and read the names of its files and their file types. Then they will open
|
||
|
the entrynam.idx file, and find the sequence offset, annotation offset and
|
||
|
division number. Then open the division lookup file, find the file name for the
|
||
|
division required, open that file, seek to the required byte and get the data.
|
||
|
|
||
|
Looking for an entry by accession number: the programs will open the library
|
||
|
description file and read the names of its files and their file types. Then
|
||
|
they open the acnum.trg and acnum.hit files. The acnum.trg file is read to find
|
||
|
the accession number and a pointer to the acnum.hit file and the number of
|
||
|
hits. That file is read and the corresponding entry names displayed. At
|
||
|
present no further action is performed, although I expect to list out the
|
||
|
titles for the entries found.
|
||
|
|
||
|
Searching the whole of a library: the programs will open the library
|
||
|
description file and read the names of its files and their file types. Then
|
||
|
they open the division lookup file, read the names and numbers of the sequence
|
||
|
files, open all of them, then open the entryname file. Then the library is
|
||
|
processed sequentially by reading the entry names, their sequence offsets and
|
||
|
division numbers from the entry names file, and then the sequence from the
|
||
|
appropriate data file.
|
||
|
|
||
|
Searching the whole of a library using a list of entry names to include: the
|
||
|
programs will open the library description file and read the names of its files
|
||
|
and their file types. Then they open the division lookup file, read the names
|
||
|
and numbers of the sequence files, open all of them, then open the entryname
|
||
|
file. Then the library is processed by reading the list of entry names and
|
||
|
finding the names in the entry names file to get their sequence offsets and
|
||
|
division numbers, and then the sequence from the appropriate data file. It will
|
||
|
stop when it reaches the end of the list of entry names. The list of entry
|
||
|
names can be in any order.
|
||
|
|
||
|
Searching the whole of a library using a list of entry names to exclude: the
|
||
|
programs will open the library description file and read the names of its files
|
||
|
and their file types. Then they open the division lookup file, read the names
|
||
|
and numbers of the sequence files, open all of them, then open the entryname
|
||
|
file. Then the library is processed sequentially by reading the list of entry
|
||
|
names, reading the next entry in the entry names file to make sure it does not
|
||
|
match, then getting the sequence offsets and division numbers, and then the
|
||
|
sequence from the appropriate data file. If a the next name matches the name on
|
||
|
the list of entry names, it will be skipped, and the next name to exclude read.
|
||
|
If the list of excluded names is finished the rest of the library is searched
|
||
|
sequentially. The list of entry names must be in the same order as those in the
|
||
|
library (ie sorted alphabetically).
|
||
|
|
||
|
Searching a whole library using a PIR format file is performed by reading it
|
||
|
sequentially. If as list of entry names is used it must be in the same order as
|
||
|
the entries in the library file.
|
||
|
---------------------------------------------------------------------------
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Section IV Installation guide
|
||
|
-----------------------------
|
||
|
|
||
|
EMBL CDROM
|
||
|
|
||
|
The data can be left on the cdrom or copied to hard disk. The files
|
||
|
staden.login and staden.profile source the file $STADTABL/libraries.config.csh
|
||
|
and $STADTABL/libraries.config.sh respectively. Refer to this file to see what
|
||
|
is required to install, add or move a sequence library that you want to be used
|
||
|
by the programs.
|
||
|
|
||
|
Other libraries (PIR, Genbank, EMBL updates)
|
||
|
|
||
|
Create the indexes then edit the files that tell the programs where the data is
|
||
|
stored. The files staden.login and staden.profile source the file
|
||
|
$STADTABL/libraries.config Refer to this file to see what is required to
|
||
|
install, add or move a sequence library that you want to be used by the
|
||
|
programs.
|
||
|
|
||
|
|
||
|
------------------------------------------------------------------------------
|
||
|
|
||
|
|
||
|
Section V New feature table handling facilities
|
||
|
-----------------------------------------------
|
||
|
|
||
|
As mentioned above EMBL and GenBank have recently introduced new feature tables
|
||
|
for annotating the sequences. They are a great improvement on the previous ones
|
||
|
and, among other things, now permit correct translation of spliced genes.
|
||
|
Various options within nip have been added or modified to take advantage of
|
||
|
these changes. The routine to translate DNA to protein and write the protein
|
||
|
to disk now gives correct results for spliced genes. The routine to translate
|
||
|
DNA to protein and display the two together now gives correct translations
|
||
|
except for the amino acids spanning intron/exon junctions. The routine to plot
|
||
|
maps from feature tables can use the new style. The open reading frame finding
|
||
|
routine writes out its results in the new style. The routine that finds open
|
||
|
reading frames and writes their translations to disk also writes a title in the
|
||
|
form of a new style feature table entry. The feature table format output from
|
||
|
the pattern searches in nip also uses the new style.
|
||
|
|
||
|
|
||
|
|
||
|
----------------------------------------------------------------------------
|
||
|
|
||
|
Section VI Indexing the sequence libraries
|
||
|
--------------------------------------------
|
||
|
|
||
|
We handle EMBL, SwissProt, and GenBank in their distributed format, plus
|
||
|
PIR and NRL3D in codata format. All programs and scripts are in directory
|
||
|
indexseqlibs.
|
||
|
|
||
|
Currently we produce entryname index, accession number index freetext index,
|
||
|
and brief index (brief index contains the entry name the primary accession
|
||
|
number the sequence length and an 80 character description).
|
||
|
|
||
|
To produce any of the indexes requires the creation of several intermediate
|
||
|
files and the indexing programs are written so that the intermediate files
|
||
|
are the same for all libraries. This means that only the programs that read
|
||
|
the distributed form of each library need to be unique to that library, and
|
||
|
all the other processing programs can be used for all libraries.
|
||
|
|
||
|
|
||
|
However even the though the indexes have the same format, programs (like nip)
|
||
|
that read the libraries need to treat each library separately because their
|
||
|
actual contents are written differently.
|
||
|
|
||
|
Making the entry name index
|
||
|
---------------------------
|
||
|
|
||
|
Common program entryname2
|
||
|
|
||
|
EMBL emblentryname1
|
||
|
SwissProt emblentryname1
|
||
|
|
||
|
GenBank genbentryname1
|
||
|
|
||
|
PIR pirentryname1
|
||
|
NRL3D pirentryname1
|
||
|
|
||
|
|
||
|
Making the accession number index
|
||
|
---------------------------------
|
||
|
|
||
|
Common programs access2 access3 access4
|
||
|
|
||
|
EMBL emblaccess1
|
||
|
SwissProt emblaccess1
|
||
|
|
||
|
GenBank genbaccess1
|
||
|
|
||
|
PIR piraccess1 piraccess2
|
||
|
NRL3D No accession numbers
|
||
|
|
||
|
Making the brief index
|
||
|
----------------------
|
||
|
|
||
|
Common program title2
|
||
|
|
||
|
EMBL embltitle1
|
||
|
SwissProt embltitle1
|
||
|
|
||
|
GenBank genbtitle1
|
||
|
|
||
|
PIR pirtitle1 pirtitle2 (pir3 has no accession numbers)
|
||
|
NRL3D pirtitle2
|
||
|
|
||
|
Scripts
|
||
|
-------
|
||
|
|
||
|
emblentryname.script
|
||
|
emblaccession.script
|
||
|
embltitle.script
|
||
|
|
||
|
swissentryname.script
|
||
|
swissaccession.script
|
||
|
swisstitle.script
|
||
|
|
||
|
genbentrynamescript
|
||
|
genbaccession.script
|
||
|
genbtitle.script
|
||
|
|
||
|
pirentryname.script
|
||
|
piraccession.script
|
||
|
pirtitle.script
|
||
|
|
||
|
nrl3dentryname.script
|
||
|
nrl3dtitle.script
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|