staden-lg/src/indexseqlibs
2023-04-14 02:09:01 +08:00
..
access2.c init 2021-12-04 05:07:58 +00:00
access4.c init 2021-12-04 05:07:58 +00:00
addnl.c init 2021-12-04 05:07:58 +00:00
author.c init 2021-12-04 05:07:58 +00:00
cdromheader.c init 2021-12-04 05:07:58 +00:00
cdromheader.h init 2021-12-04 05:07:58 +00:00
CHANGES init 2021-12-04 05:07:58 +00:00
data-flow.doc init 2021-12-04 05:07:58 +00:00
division.c init 2021-12-04 05:07:58 +00:00
emblaccess1.c init 2021-12-04 05:07:58 +00:00
emblaccession.script init 2021-12-04 05:07:58 +00:00
emblauthor.script init 2021-12-04 05:07:58 +00:00
embldivision.script init 2021-12-04 05:07:58 +00:00
emblentryname.script init 2021-12-04 05:07:58 +00:00
emblentryname1.c init 2021-12-04 05:07:58 +00:00
emblfreetext.script init 2021-12-04 05:07:58 +00:00
embltitle.script init 2021-12-04 05:07:58 +00:00
embltitle1.c init 2021-12-04 05:07:58 +00:00
embluaccession.script init 2021-12-04 05:07:58 +00:00
embluauthor.script init 2021-12-04 05:07:58 +00:00
embludivision.script init 2021-12-04 05:07:58 +00:00
embluentryname.script init 2021-12-04 05:07:58 +00:00
emblufreetext.script init 2021-12-04 05:07:58 +00:00
emblutitle.script init 2021-12-04 05:07:58 +00:00
entryname2.c init 2021-12-04 05:07:58 +00:00
excludewords.c fix: compile all 2023-04-14 02:02:58 +08:00
freetext.c init 2021-12-04 05:07:58 +00:00
freetext2.c init 2021-12-04 05:07:58 +00:00
freetext4.c init 2021-12-04 05:07:58 +00:00
genbaccess1.c init 2021-12-04 05:07:58 +00:00
genbaccession.script init 2021-12-04 05:07:58 +00:00
genbauthor.script init 2021-12-04 05:07:58 +00:00
genbdivision.script init 2021-12-04 05:07:58 +00:00
genbentryname.script init 2021-12-04 05:07:58 +00:00
genbentryname1.c init 2021-12-04 05:07:58 +00:00
genbfreetext.script init 2021-12-04 05:07:58 +00:00
genbtitle.script init 2021-12-04 05:07:58 +00:00
genbtitle1.c init 2021-12-04 05:07:58 +00:00
getEMBLupdates.script init 2021-12-04 05:07:58 +00:00
getstopwords.script init 2021-12-04 05:07:58 +00:00
hitNtrg.c init 2021-12-04 05:07:58 +00:00
mach-io.c init 2021-12-04 05:07:58 +00:00
mach-io.h init 2021-12-04 05:07:58 +00:00
makefile use default makefile and recovery previous alpha makefile 2023-04-14 02:09:01 +08:00
makefile-alpha use default makefile and recovery previous alpha makefile 2023-04-14 02:09:01 +08:00
makefile-dec init 2021-12-04 05:07:58 +00:00
makefile-sgi init 2021-12-04 05:07:58 +00:00
makefile-solaris init 2021-12-04 05:07:58 +00:00
makefile-sun init 2021-12-04 05:07:58 +00:00
makenrl3d.script init 2021-12-04 05:07:58 +00:00
piraccess1.c init 2021-12-04 05:07:58 +00:00
piraccess2.c init 2021-12-04 05:07:58 +00:00
piraccession.script init 2021-12-04 05:07:58 +00:00
pirauthor.script init 2021-12-04 05:07:58 +00:00
pirdivision.script init 2021-12-04 05:07:58 +00:00
pirentryname.script init 2021-12-04 05:07:58 +00:00
pirentryname1.c init 2021-12-04 05:07:58 +00:00
pirfreetext.script init 2021-12-04 05:07:58 +00:00
pirtitle.script init 2021-12-04 05:07:58 +00:00
pirtitle1.c init 2021-12-04 05:07:58 +00:00
pirtitle2.c init 2021-12-04 05:07:58 +00:00
README init 2021-12-04 05:07:58 +00:00
stopwords init 2021-12-04 05:07:58 +00:00
swissaccession.script init 2021-12-04 05:07:58 +00:00
swissauthor.script init 2021-12-04 05:07:58 +00:00
swissdivision.script init 2021-12-04 05:07:58 +00:00
swissentryname.script init 2021-12-04 05:07:58 +00:00
swissfreetext.script init 2021-12-04 05:07:58 +00:00
swisstitle.script init 2021-12-04 05:07:58 +00:00
title2.c init 2021-12-04 05:07:58 +00:00

Notes on indexing the sequence libraries
========================================

We handle EMBL, SwissProt, GenBank, PIR in codata form, NRL3D.

Currently we produce entryname index, accession number index, brief
index (brief index contains the entry name the primary accession number
the sequence length and an 80 character description) and a freetext
index for all but nrl3d (only entryname and brief).


Naturally the libraries are all in different formats.

To produce any of the indexes requires the creation of several intermediate
files and the indexing programs are written so that the intermediate files
are the same for all libraries. This means that only the programs that read
the distributed form of each library need to be unique to that library, and
all the other processing programs can be used for all libraries.


However even the though the indexes have the same format, programs (like nip)
that read the libraries need to treat each library separately because their
actual contents are written differently.


With the exception of the freetext index creation script all the
procedures run quite quickly.

Making the entry name index
---------------------------

Common program entryname2

EMBL		emblentryname1
SwissProt	emblentryname1

GenBank		genbentryname1

PIR		pirentryname1
NRL3D		pirentryname1


Making the accession number index
---------------------------------

Common programs access2 access4

EMBL		emblaccess1
SwissProt	emblaccess1

GenBank		genbaccess1

PIR		piraccess1 piraccess2 
NRL3D		No accession numbers

Making the brief index
----------------------

Common program title2

EMBL		embltitle1
SwissProt	embltitle1

GenBank		genbtitle1

PIR		pirtitle1 pirtitle2 (pir3 has no accession numbers)
NRL3D		pirtitle2

Making the freetext index
-------------------------

Common programs freetext2 freetext4

EMBL		emblfreetext
SwissProt	emblfreetext

GenBank		genbfreetext

PIR		pirfreetext
NRL3D		not done

Note the file stopwords is required.

Scripts
-------

emblentryname.script
emblaccession.script
embltitle.script
emblfreetext.script

swissentryname.script
swissaccession.script
swisstitle.script
emblfreetext.script


genbentrynamescript
genbaccession.script
genbtitle.script
genbfreetext.script

pirentryname.script
piraccession.script
pirtitle.script
pirfreetext.script

makenrl3d.script