staden-lg/help/splitp_help


        Preparing the PROSITE protein motif library  for  use  by
the Staden programs

        Introduction

        A library of protein motifs (in our terminology,  because
they  include  variable  gaps, some would be called patterns) has
recently  become  available  from  Amos  Bairoch,Departement   de
Biochimie Medicale,University of Geneva Currently it contains 317
patterns/motifs and arrives on tape or cdrom in two files: a .dat
file  and  a  .doc  file. There is also a user documentation file
prosite.usr. Here I outline  what  is  required  to  prepare  the
PROSITE library for use by our programs.

        Three programs need  to  be  run  SPLITP1,  SPLITP2,  and
SPLITP3.

        Outline of the PROSITE files

        A typical entry in the .dat file is shown below.

ID   2FE2S_FERREDOXIN; PATTERN.
AC   PS00197;
DT   APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).
DE   2Fe-2S ferredoxins, iron-sulfur binding region signature.
PA   C-x(1,2)-[STA]-x(2)-C-[STA]-{P}-C.
NR   /RELEASE=14,15409;
NR   /TOTAL=69(69); /POSITIVE=63(63); /UNKNOWN=0(0); /FALSE_POS=6(6);
NR   /FALSE_NEG=5(5);
CC   /TAXO-RANGE=A?EP?; /MAX-REPEAT=1;
CC   /SITE=1,iron_sulfur; /SITE=5,iron_sulfur; /SITE=8,iron_sulfur;
DR   P15788, FER$APHHA , T; P00250, FER$APHSA , T; P00223, FER$ARCLA , T;
DR   P00227, FER$BRANA , T; P07838, FER$BRYMA , T; P13106, FER$BUMFI , T;
DR   P00247, FER$CHLFR , T; P07839, FER$CHLRE , T; P00222, FER$COLES , T;
DO   PDOC00175;
//

        Each entry has an  accession  number  (here  PS00197),  a
pattern definition (here C-x(1,2)-[STA]-x(2)-C-[STA]-{P}-C) and a
documentation  file  cross  reference  (here  PDOC00175).    This
pattern  means: C, gap of 1 or 2, any of STA, gap of 2, C, any of
STA, not P, C.

        We need to convert all of these patterns into our pattern
definitions  (as  membership  of  a  set, with the appopriate gap
ranges)  and  write  each  into  a  separate  pattern  file  with
corresponding "membership of a set" weight matrices. Each pattern
file  is  named  accession_number.pat  (here  PS00197.PAT).   The
corresponding    matrix    files    are    accession_number.wtsa,
accession_number.wtsb, etc for  however  many  are  needed  (here
PS00197.WTSA  and  PS00197.WTSB):  two  are needed because of the
variable gap.

        In addition we can optionally split  the  .dat  and  .doc
files  into  separate  files,  one  for  each  entry,  with names
accession_number.dat and accession_number.doc. Also we create  an
index  for  the  library  prosite.lis,  which  gives  a  one line
description of each pattern, and ends with the pattern  file  and
documentation file numbers. The start of the file is shown below.

N-glycosylation site.                                                00001,00001
Glycosaminoglycan attachment site.                                   00002,00002
Tyrosine sulfatation site.                                           00003,00003
cAMP- and cGMP-dependent protein kinase phosphorylation site.        00004,00004

So the name of the pattern file for Glycosaminoglycan  attachment
site is PS00002.PAT, and for the documentation file PDOC00002.DOC

        Finally we create a  file  of  file  names  for  all  the
patterns in the library.

        To use the complete PROSITE  library  from  program  pip,
select  "pattern  searcher"  and  choose  the option "use file of
pattern file names", and give the file name prosite.nam). For any
matches  found,  the  accession  number and pattern title will be
displayed.

        Running the conversion programs

        Only SPLITP3 is necessary  for  using  the  library.  The
others programs only make the original files marginally easier to
browse through and produce an index.

        SPLITP1 splits the prosite.dat file to create a  separate
file   for   each   entry.   Each  file  is  automatically  named
PSentry_number.dat. In addition  it  creates  an  index  for  the
library (see above).

        SPLITP2 performs the same operation for  the  Prosite.doc
file,   except   that  no  index  is  created.  Files  are  named
PSentry_number.doc.

        SPLITP3 creates a separate pattern file and weight matrix
files  for  each prosite entry from the file prosite.dat. Pattern
files  are  named   PSentry_number.pat,   weight   matrix   files
PSentry_number.wtsa,  Psentry_number.wtsb, etc. The pattern title
is the one line description of the motif. SPLITP3 also creates  a
file  of  file  names. Notice that it will ask for a path name so
that the path can be included in the file of file names. This  is
the path to the directory in which the pattern files are stored.

        Notes

        Obviously the use of files of file  names  is  a  general
solution,   and  anybody  could  now  create  their  own  set  of
interesting patterns for screening, or a subset  of  prosite.nam,
etc.

        Note that 5 of the bairoch motifs contained the symbols >
or  < which means that the motifs must appear exactly at the N or
C termini  of  the  sequences.  Currently  our  methods  have  no
mechanism for such definitions and, for example KDEL motifs, will
be permitted to occur anywhere throughout a sequence.

        Also, of course, the library does not  have  to  be  used
solely  for performing mass screenings: each individual entry can
be used as a single pattern by giving the name of its .pat file -
eg pathname/ps00002.pat In addition more sophisticated users will
wish to copy pattern files and weight  matrices  into  their  own
directories  and  modify  them. For example the cutoff scores are
probably chosen to be quite high in order to reduce the number of
false positives, and some users might wish to lower them.
init 2021-12-04 13:07:58 +08:00
			`Preparing the PROSITE protein motif library for use by`
			`the Staden programs`

			`Introduction`

			`A library of protein motifs (in our terminology, because`
			`they include variable gaps, some would be called patterns) has`
			`recently become available from Amos Bairoch,Departement de`
			`Biochimie Medicale,University of Geneva Currently it contains 317`
			`patterns/motifs and arrives on tape or cdrom in two files: a .dat`
			`file and a .doc file. There is also a user documentation file`
			`prosite.usr. Here I outline what is required to prepare the`
			`PROSITE library for use by our programs.`

			`Three programs need to be run SPLITP1, SPLITP2, and`
			`SPLITP3.`

			`Outline of the PROSITE files`

			`A typical entry in the .dat file is shown below.`

			`ID 2FE2S_FERREDOXIN; PATTERN.`
			`AC PS00197;`
			`DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).`
			`DE 2Fe-2S ferredoxins, iron-sulfur binding region signature.`
			`PA C-x(1,2)-[STA]-x(2)-C-[STA]-{P}-C.`
			`NR /RELEASE=14,15409;`
			`NR /TOTAL=69(69); /POSITIVE=63(63); /UNKNOWN=0(0); /FALSE_POS=6(6);`
			`NR /FALSE_NEG=5(5);`
			`CC /TAXO-RANGE=A?EP?; /MAX-REPEAT=1;`
			`CC /SITE=1,iron_sulfur; /SITE=5,iron_sulfur; /SITE=8,iron_sulfur;`
			`DR P15788, FER$APHHA , T; P00250, FER$APHSA , T; P00223, FER$ARCLA , T;`
			`DR P00227, FER$BRANA , T; P07838, FER$BRYMA , T; P13106, FER$BUMFI , T;`
			`DR P00247, FER$CHLFR , T; P07839, FER$CHLRE , T; P00222, FER$COLES , T;`
			`DO PDOC00175;`
			`//`

			`Each entry has an accession number (here PS00197), a`
			`pattern definition (here C-x(1,2)-[STA]-x(2)-C-[STA]-{P}-C) and a`
			`documentation file cross reference (here PDOC00175). This`
			`pattern means: C, gap of 1 or 2, any of STA, gap of 2, C, any of`
			`STA, not P, C.`

			`We need to convert all of these patterns into our pattern`
			`definitions (as membership of a set, with the appopriate gap`
			`ranges) and write each into a separate pattern file with`
			`corresponding "membership of a set" weight matrices. Each pattern`
			`file is named accession_number.pat (here PS00197.PAT). The`
			`corresponding matrix files are accession_number.wtsa,`
			`accession_number.wtsb, etc for however many are needed (here`
			`PS00197.WTSA and PS00197.WTSB): two are needed because of the`
			`variable gap.`

			`In addition we can optionally split the .dat and .doc`
			`files into separate files, one for each entry, with names`
			`accession_number.dat and accession_number.doc. Also we create an`
			`index for the library prosite.lis, which gives a one line`
			`description of each pattern, and ends with the pattern file and`
			`documentation file numbers. The start of the file is shown below.`

			`N-glycosylation site. 00001,00001`
			`Glycosaminoglycan attachment site. 00002,00002`
			`Tyrosine sulfatation site. 00003,00003`
			`cAMP- and cGMP-dependent protein kinase phosphorylation site. 00004,00004`

			`So the name of the pattern file for Glycosaminoglycan attachment`
			`site is PS00002.PAT, and for the documentation file PDOC00002.DOC`

			`Finally we create a file of file names for all the`
			`patterns in the library.`

			`To use the complete PROSITE library from program pip,`
			`select "pattern searcher" and choose the option "use file of`
			`pattern file names", and give the file name prosite.nam). For any`
			`matches found, the accession number and pattern title will be`
			`displayed.`

			`Running the conversion programs`

			`Only SPLITP3 is necessary for using the library. The`
			`others programs only make the original files marginally easier to`
			`browse through and produce an index.`

			`SPLITP1 splits the prosite.dat file to create a separate`
			`file for each entry. Each file is automatically named`
			`PSentry_number.dat. In addition it creates an index for the`
			`library (see above).`

			`SPLITP2 performs the same operation for the Prosite.doc`
			`file, except that no index is created. Files are named`
			`PSentry_number.doc.`

			`SPLITP3 creates a separate pattern file and weight matrix`
			`files for each prosite entry from the file prosite.dat. Pattern`
			`files are named PSentry_number.pat, weight matrix files`
			`PSentry_number.wtsa, Psentry_number.wtsb, etc. The pattern title`
			`is the one line description of the motif. SPLITP3 also creates a`
			`file of file names. Notice that it will ask for a path name so`
			`that the path can be included in the file of file names. This is`
			`the path to the directory in which the pattern files are stored.`

			`Notes`

			`Obviously the use of files of file names is a general`
			`solution, and anybody could now create their own set of`
			`interesting patterns for screening, or a subset of prosite.nam,`
			`etc.`

			`Note that 5 of the bairoch motifs contained the symbols >`
			`or < which means that the motifs must appear exactly at the N or`
			`C termini of the sequences. Currently our methods have no`
			`mechanism for such definitions and, for example KDEL motifs, will`
			`be permitted to occur anywhere throughout a sequence.`

			`Also, of course, the library does not have to be used`
			`solely for performing mass screenings: each individual entry can`
			`be used as a single pattern by giving the name of its .pat file -`
			`eg pathname/ps00002.pat In addition more sophisticated users will`
			`wish to copy pattern files and weight matrices into their own`
			`directories and modify them. For example the cutoff scores are`
			`probably chosen to be quite high in order to reduce the number of`
			`false positives, and some users might wish to lower them.`