staden-lg/SequenceLibraries

421 lines
14 KiB
Plaintext

Notes on library handling
-------------------------
Contents of this document:
I) Introduction
II) Details of file organisation and use
III) Options currently available
IV) Installation guide
V) New feature table handling routines
VI) Indexing the sequence libraries
Section I Introduction
----------------------
Available sequence libraries
There are a number of different sequence libraries for nucleotide and protein:
PIR, GenBank, EMBL, Swissprot, and the Japanese Databank. Even after all the
years of their existence they still use different formats for their data. This
provides tedious and unrewarding work for software developers. Recently EMBL
and GenBank agreed a new and common way of writing their feature tables, which
is great help, although the rest of their format is different. Swissprot still
uses the old embl style feature table format and PIR yet another.
All the libraries distribute their data on magnetic tapes and EMBL and GenBank
have started to distribute on cdrom. The EMBL cdrom also contains Swissprot.
The GenBank and EMBL cdroms use different formats and have different contents.
The EMBL cdrom has useful indexes sorted alphabetically: those for entry name
and accession number, brief descriptions, keywords and freetext indexes are
already available and others are expected. These indexes point to the data for
each entry, and can be used to extract the data for any entry quickly.
Moving to unix
The VAX version of our package used PIR format which meant reformatting all
libraries other than PIR into that format. This required, at least
temporarily, having space for two copies of the libraries, and quite a lot of
cpu time. The software for doing this was provided by PIR, and is very VAX
specific and hence will not run under unix. For the unix version of our package
I have decided to use the EMBL cdrom format and its indexes as the primary
format. The current programs also support the use of PIR format libraries
without indexes - ie just the sequence and annotation files.
Indexing GenBank, EMBL updates, PIR and NRL3D
We include programs to create indexes for the above libraries. See below and
the README file in indexseqlibs. The programs can read all the above libraries
once the indexes are created. The indexing programs index the data in its
distributed form: WE DO NOT REFORMAT OR COPY THE LIBRARIES but simply create
indexes to the original files. Obviously this saves a lot of disk space, and
for those content to use only embl and swissprot from the cdrom, almost no disk
space is required. We havent tried it yet, but for genbank on cdrom, the only
extra disk space required would be for the indexes.
---------------------------------------------------------------------------
Section II Details of file organisation and use
-----------------------------------------------
The following strategy has been used to try to deal with alternate
and changing sequence library formats.
1) libraries are described at several levels:
a) the top level file is a list of available libraries which contains:
the library type, the name of the file containing the name of
each libraries individual files, and the prompt to appear on
the users screen: LTYPE LOGNAM PROMPT
b) the file containing the names of the libraries individual files
contains flags to define the file types: FTYPE LOGNAM
c) the individual library files
2) libary types handled:
a) EMBL/SWISSPROT in distributed format with cdrom index format
LTYPE = 'A'
b) GenBank in distributed format with cdrom index format LTYPE = 'C'
c) PIR/NRL3D in CODATA format with cdrom index format LTYPE = 'B'
d) PIR/NBRF .seq files can be read sequentially as "personal files
in PIR format" and do not appear in the list of available libraries.
e) FASTA format files can be read sequentially as "personal files
in FASTA format" and do not appear in the list of available
libraries.
3) EMBL, SWISSPROT and other libraries for which EMBL-style indexes have been
created
current file types:
A division.lookup
B entryname.index
C accession.target
D accession.hits
E brief description
F freetext.target
G freetext.hits
H author.target
I author.hits
Library list
level 1
|
|
-----------------------------------------------------------
| | |
lib 1 file list lib 2 file list lib 3 file list
level 2
| |
-------- ---------
level 3
file 1 file 1
file 2 file 2
. .
file n file n
---------------------------------------------------------------------------
Example
-------
Level 1
File name: sequence.libs
Environment variable: SEQUENCELIBRARIES
Contents:
A EMBLFILES EMBL nucleotide library ! in cdrom format
C GENBFILES GenBank nucleotide library!
A SWISSFILES SWISSPROT protein library! in cdrom format
B PIRFILES PIR protein library!
B NRL3DFILES NRL3D protein library!
Notes:
The libraries have types A,B,C. The logical names are EMBLLIBDESCRP and
SWISSLIBDESCRP, etc and the prompts are 'EMBL nucleotide library' and
'SWISSPROT protein library', etc. Anything to the right of a ! is a comment.
Level 2: the list of library files (using embl as an example)
File name: embl.files
Environment variable: EMBLFILES
Contents:
A EMBLDIVPATH/embl_div.lkp
B EMBLINDPATH/entrynam.idx
C EMBLINDPATH/acnum.trg
D EMBLINDPATH/acnum.hit
E EMBLINDPATH/brief.idx
F EMBLINDPATH/freetext.trg
G EMBLINDPATH/freetext.hit
H EMBLINDPATH/author.trg
I EMBLINDPATH/author.hit
Level 3: the sequence and annotation files (eg 15 for embl, 1 for swissprot).
Paths and file names:
EMBLPATH/bb.dat
EMBLPATH/fun.dat
EMBLPATH/inv.dat
EMBLPATH/mam.dat
EMBLPATH/org.dat
EMBLPATH/patent.dat
EMBLPATH/phg.dat
EMBLPATH/pln.dat
EMBLPATH/pri.dat
EMBLPATH/pro.dat
EMBLPATH/rod.dat
EMBLPATH/syn.dat
EMBLPATH/una.dat
EMBLPATH/vrl.dat
EMBLPATH/vrt.dat
All files from the division lookup file down are exactly as they appear on the
cdrom. The division lookup file relates numbers stored in the indexes to
actual division (or data) files stored on the disk. We rewrite it so the
directory structure and file names can be chosen locally. Its format is
I6,1x,A. An example is given below.
Division lookup file
File name: STADTABL/embl_div.lkp
Environment variable path EMBLDIVPATH
Contents:
1 EMBLPATH/bb.dat
2 EMBLPATH/fun.dat
3 EMBLPATH/inv.dat
4 EMBLPATH/mam.dat
5 EMBLPATH/org.dat
6 EMBLPATH/patent.dat
7 EMBLPATH/phg.dat
8 EMBLPATH/pln.dat
9 EMBLPATH/pri.dat
10 EMBLPATH/pro.dat
11 EMBLPATH/rod.dat
12 EMBLPATH/syn.dat
13 EMBLPATH/una.dat
14 EMBLPATH/vrl.dat
15 EMBLPATH/vrt.dat
---------------------------------------------------------------------------
Section III Options currently available
---------------------------------------
Facilities currently offered in nip,pip,sip,nipl,pipl,sipl:
Get a sequence by knowing its entry name
Get a sequences' annotation by knowing its entry name
Get an entry name by knowing its accession number
Search the freetext index
Search the author index
Facilities currently offered in nipl,pipl,sipl:
Search whole library
Search only a list of entry names
Search all but a list of entry names
Outline of each type of operation
Looking for an entry by name: the programs will open the library description
file and read the names of its files and their file types. Then they will open
the entrynam.idx file, and find the sequence offset, annotation offset and
division number. Then open the division lookup file, find the file name for the
division required, open that file, seek to the required byte and get the data.
Looking for an entry by accession number: the programs will open the library
description file and read the names of its files and their file types. Then
they open the acnum.trg and acnum.hit files. The acnum.trg file is read to find
the accession number and a pointer to the acnum.hit file and the number of
hits. That file is read and the corresponding entry names displayed. At
present no further action is performed, although I expect to list out the
titles for the entries found.
Searching the whole of a library: the programs will open the library
description file and read the names of its files and their file types. Then
they open the division lookup file, read the names and numbers of the sequence
files, open all of them, then open the entryname file. Then the library is
processed sequentially by reading the entry names, their sequence offsets and
division numbers from the entry names file, and then the sequence from the
appropriate data file.
Searching the whole of a library using a list of entry names to include: the
programs will open the library description file and read the names of its files
and their file types. Then they open the division lookup file, read the names
and numbers of the sequence files, open all of them, then open the entryname
file. Then the library is processed by reading the list of entry names and
finding the names in the entry names file to get their sequence offsets and
division numbers, and then the sequence from the appropriate data file. It will
stop when it reaches the end of the list of entry names. The list of entry
names can be in any order.
Searching the whole of a library using a list of entry names to exclude: the
programs will open the library description file and read the names of its files
and their file types. Then they open the division lookup file, read the names
and numbers of the sequence files, open all of them, then open the entryname
file. Then the library is processed sequentially by reading the list of entry
names, reading the next entry in the entry names file to make sure it does not
match, then getting the sequence offsets and division numbers, and then the
sequence from the appropriate data file. If a the next name matches the name on
the list of entry names, it will be skipped, and the next name to exclude read.
If the list of excluded names is finished the rest of the library is searched
sequentially. The list of entry names must be in the same order as those in the
library (ie sorted alphabetically).
Searching a whole library using a PIR format file is performed by reading it
sequentially. If as list of entry names is used it must be in the same order as
the entries in the library file.
---------------------------------------------------------------------------
Section IV Installation guide
-----------------------------
EMBL CDROM
The data can be left on the cdrom or copied to hard disk. The files
staden.login and staden.profile source the file $STADTABL/libraries.config.csh
and $STADTABL/libraries.config.sh respectively. Refer to this file to see what
is required to install, add or move a sequence library that you want to be used
by the programs.
Other libraries (PIR, Genbank, EMBL updates)
Create the indexes then edit the files that tell the programs where the data is
stored. The files staden.login and staden.profile source the file
$STADTABL/libraries.config Refer to this file to see what is required to
install, add or move a sequence library that you want to be used by the
programs.
------------------------------------------------------------------------------
Section V New feature table handling facilities
-----------------------------------------------
As mentioned above EMBL and GenBank have recently introduced new feature tables
for annotating the sequences. They are a great improvement on the previous ones
and, among other things, now permit correct translation of spliced genes.
Various options within nip have been added or modified to take advantage of
these changes. The routine to translate DNA to protein and write the protein
to disk now gives correct results for spliced genes. The routine to translate
DNA to protein and display the two together now gives correct translations
except for the amino acids spanning intron/exon junctions. The routine to plot
maps from feature tables can use the new style. The open reading frame finding
routine writes out its results in the new style. The routine that finds open
reading frames and writes their translations to disk also writes a title in the
form of a new style feature table entry. The feature table format output from
the pattern searches in nip also uses the new style.
----------------------------------------------------------------------------
Section VI Indexing the sequence libraries
--------------------------------------------
We handle EMBL, SwissProt, and GenBank in their distributed format, plus
PIR and NRL3D in codata format. All programs and scripts are in directory
indexseqlibs.
Currently we produce entryname index, accession number index freetext index,
and brief index (brief index contains the entry name the primary accession
number the sequence length and an 80 character description).
To produce any of the indexes requires the creation of several intermediate
files and the indexing programs are written so that the intermediate files
are the same for all libraries. This means that only the programs that read
the distributed form of each library need to be unique to that library, and
all the other processing programs can be used for all libraries.
However even the though the indexes have the same format, programs (like nip)
that read the libraries need to treat each library separately because their
actual contents are written differently.
Making the entry name index
---------------------------
Common program entryname2
EMBL emblentryname1
SwissProt emblentryname1
GenBank genbentryname1
PIR pirentryname1
NRL3D pirentryname1
Making the accession number index
---------------------------------
Common programs access2 access3 access4
EMBL emblaccess1
SwissProt emblaccess1
GenBank genbaccess1
PIR piraccess1 piraccess2
NRL3D No accession numbers
Making the brief index
----------------------
Common program title2
EMBL embltitle1
SwissProt embltitle1
GenBank genbtitle1
PIR pirtitle1 pirtitle2 (pir3 has no accession numbers)
NRL3D pirtitle2
Scripts
-------
emblentryname.script
emblaccession.script
embltitle.script
swissentryname.script
swissaccession.script
swisstitle.script
genbentrynamescript
genbaccession.script
genbtitle.script
pirentryname.script
piraccession.script
pirtitle.script
nrl3dentryname.script
nrl3dtitle.script