@-1. TX 0 @General @-2. T 0 @Screen control @-2. X 0 @Screen @-3. T 0 @Statistical analysis of content @-3. X 0 @Statistics @-4. T 0 @Structures and repeats @-4. X 0 @Structures @-5. TX 0 @Search @0. TX -1 @PIP This is a program for analysing individual protein sequences. It can read sequences stored in many of the most commonly used formats, and performs all of the usual simple analyses. In addition it has very flexible search procedures and presents many of its results graphically. The following analyses (preceded by their option numbers) are included: ? = Help ! = Quit 3 = read a new sequence 4 = define active region 5 = list the sequence 6 = list a text file 7 = direct output to disk 8 = write active sequence to disk 9 = edit the sequence 10 = clear graphics screen 11 = clear text screen 12 = draw a ruler 13 = use cross hair 14 = reposition plots 15 = label diagram 16 = display a map 17 = search for short sequences 18 = compare a sequence 19 = compare a sequence using a score matrix 20 = search for a sequence using a weight matrix 21 = calculate amino acid composition 22 = plot hydrophobicity 23 = plot charge 24 = plot Robson prediction 25 = plot hydrophobic moment 26 = draw helix wheel 27 = back translate 28 = search for patterns of motifs Some of these methods produce graphical results and so the program is generally used from a graphics terminal (a vdu on which lines and points can be drawn as well as characters). For users of VT640's or their equivalents the terminal must be set nowrap (type NOWRAP) prior to running the program. The positions of each of the plots is defined relative to a users drawing board which has size 1-10,000 in x and 1-10,000 in y. Plots for each option are drawn in a window defined by x0,y0 and xlength,ylength. Where x0,y0 is the position of the bottom left hand corner of the window, and xlength is the width of the window and ylength the height of the window. --------------------------------------------------------- 10,000 1 1 1 -------------------------------------- ^ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ylength 1 1 1 1 1 1 1 1 1 1 1 1 -------------------------------------- v 1 1 x0,y0^ 1 1 <---------------xlength--------------> 1 --------------------------------------------------------- 1 1 10,000 All values are in drawing board units (i.e. 1-10,000, 1-10,000). The default window positions are read from a file "ANALPMRG" when the program is started. Users can have their own file if required. The program can handle sequences stored in several formats: Staden, EMBL, GENBANK, PIR (also known as NBRF) and GCG and they are described in the help for 'READ NEW SEQUENCE'. The options for the program are accessed from 5 main menus: general, screen control, statistical analysis of content, structure, search. Both menus and options are selected by number. @1. TX 0 @Help This option gives online help. The user should select option numbers and the current documentation will be given. Note that option 0 gives an introduction to the program, and that ? will get help from anywhere in the program. The following analyses (preceded by their option numbers) are included: @2. TX 0 @Quit This function stops the program. @3. TX 1 @Read a new sequence This option allows users to read in new sequences, browse through annotations, or search sequence libraries for keywords. Sequences can be read from "personal" sequence files or from sequence libraries. These are referred to as the sequence "source". Personal files can be stored in several formats: Staden, PIR, EMBL, GENBANK and GCG. At LMB we use "Staden" format for sequencing and all the libraries are stored in their original formats. Note, however, that libraries such as EMBL or GenBank that are divided into several files (eg GenBank has 13 separate files) are indexed as a whole. This means that users do not need to know which file contains an entry, only which library. When the user selects to read in a sequence the program first asks for the sequence "source". If the user selects "personal" the program will ask for the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of the file. For PIR format the user will also be required to know the entry name of the sequence as the file can contain several. For the other formats only a single entry is expected. The file will be read, its length and composition will be displayed and the option left. If the user selects "library" as the sequence source the program will display a list of available libraries. The programs are capable of handling all current libraries but which ones are available will vary from site to site. At LMB we have several libraries and also weekly updates of data gathered between releases. The program will ask users to select a library and then give a list of options: X 1 Get a sequence 2 Get annotations 3 Get entrynames from accession numbers 4 Search titles for keywords 5 Search text index for keywords If get a sequence or get annotations is selected users will be asked to type the entry name. The option will be left when a sequence is selected or ! is typed. The composition and length will be displayed. The text index contains all words from feature tables, reference titles, definition lines, keywords lists and comments, so the text index search is most useful. It is also the fastest. Up to 5 words can be searched for at once. The words should be typed separated by spaces, for example ? Keywords=P53 mouse murine tumo will search for all entries that contain words starting with p53, mouse, murine and tumo. Only the unique entries that contain ALL words will be listed. Before listing the matching entries the program will show the number of 'hits' for each word and ring the bell. Escape is possible at this point, or after each screenfull of entries. In addition to the entry names the text search displays the primary accession number, the sequence length and up to 80 characters of description. (The search of 'titles' is now redundant because the full text index contains all the title words and the search is much faster. It will probably be removed from the program.) All searches are independent of case. Where possible the program will offer default entry names. Typical dialogue follows. Select sequence source X 1 Personal file 2 Sequence library ? Selection (1-2) (1) = Select sequence file format X 1 Staden 2 EMBL 3 GenBank 4 PIR 5 GCG ? Selection (1-5) (1) = ? Sequence file name=M13MP7.SEQ Contig title removed Sequence length= 7238 Sequence composition T C A G - 2405. 1539. 1765. 1527. 2. 33.2% 21.3% 24.4% 21.1% 0.0% . . . Select sequence source X 1 Personal file 2 Sequence library ? Selection (1-2) (1) =2 Select a library X 1 EMBL 29 nucleotide library Dec 91 2 SWISSPROT 20 protein library Nov 91 3 PIR 31 protein library Dec 91 4 NRL3D 58 From Brookhaven protein library Dec 91 5 GenBank ? Selection (1-5) (1) = Library is in EMBL format with indexes Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =5 Search for keywords ? Keywords=P53 mouse P53 hits 68 MOUSE hits 8180 MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon MMANT11 X00885 562 Murine p53 gene 3' region with exon 11 MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end. MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds. MMP53 X00741 1377 Mouse mRNA for transformation associated protein MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53. MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11. MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8. MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4 MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53 MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53 MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5 22 different entries found Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =4 Search for keywords ? Keywords=alpha Searching for alpha AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank. ADHBADA1 102 duck alpha-d-globin gene, exon 1. ADHBADA2 1145 duck alpha-a-globin gene and 5' flank ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank. AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide. AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b AGAMYLS 120 synthetic alpha-amylase gene, 5' end. AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han ! Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =3 ? Accession number=v00636 Entry name LAMBDA Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =2 Default Entry name=LAMBDA ? Entry name= ID LAMBDA standard; DNA; PHG; 48502 BP. XX AC V00636; J02459; M17233; X00906; XX DT 03-JUL-1991 (Rel. 28, Last updated, Version 3) DT 09-JUN-1982 (Rel. 1, Created) XX DE Genome of the bacteriophage lambda (Styloviridae). XX KW circular; coat protein; DNA binding protein; genome; KW origin of replication. XX OS Bacteriophage lambda OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae. XX RN [1] RP 1-48502 RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.; RT "Nucleotide sequence of bacteriophage lambda DNA"; RL J. Mol. Biol. 162:729-773(1982). XX ! Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) = Default Entry name=LAMBDA ? Entry name= DE Genome of the bacteriophage lambda (Styloviridae). Sequence length 48502 Sequence composition T C A G - 11988. 11360. 12336. 12818. 0. 24.7% 23.4% 25.4% 26.4% 0.0% @4. TX 1 @Redefine active region For its analytic functions the program always works on a region of the sequence called the active region. When a new sequence is read into the program the active region is automatically set to start at the beginning of the sequence and go up to the maximum allowed size of active region the version of the program can handle. The positions are shown on the screen. On most machines this will be to the end of the sequence. This option allows the user define a different region. Note that for convenience in the listing and translation functions the user is given access to regions outside the active region. @5. TX 1 @List a sequence The sequence can be listed with line lengths from 10 to 120 in multiples of 10. Output can be directed to a disk file by first selecting disk output. The output looks like: 10 20 30 40 50 60 MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY 70 80 90 100 110 120 AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP 130 140 150 160 170 180 LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI 190 200 210 220 230 240 DAIINQRDSG IKCIYVAIGQ KASTISNVVR KLEEHGALAN TIVVVATASE SAALQYLARM 250 260 270 280 290 300 PVALMGEYFR DRGEDALIIY DDLSKQAVAY RQISLLLRRP PGREAFPGDV FYLHSRLLER 310 320 330 340 350 360 AARVNAEYVE AFTKGEVKGK TGSLTALPII ETQAGDVSAF VPTNVISITD GQIFLETNLF 370 380 390 400 410 420 NAGIRPAVNP GISVSRVGGA AQTKIMKKLS GGIRTALAQY RELAAFSQFA SDLDDATRKQ 430 440 450 460 470 480 LDHGQKVTEL LKQKQYAPMS VAQQSLVLFA AERGYLADVE LSKIGSFEAA LLAYVDRDHA 490 500 510 520 530 540 PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW* @6. TX 1 @List a text file Allows the user to have a text file displayed on the screen. It will appear one page at a time. @7. TX 1 @Direct output to disk Used to direct output that would normally appear on the screen to a file. Select redirection of either text or graphics, and supply the name of the file that the output should be written to. The results from the next options selected will not appear on the screen but will be written to the file. When option 7 is selected again the file will be closed and output will again appear on the screen. @8. TX 1 @Write active region to disk The program has the capability of reading in EMBL, GENBANK, NBRF, GCG and Staden formats and of reversing and complementing sequences. This option allows users to write the current active sequence to a disk file in Staden format. Hence it allows format conversion and crude sequence cutting. @9. TX 1 @Edit the sequence Used to edit sequences or any other files by giving access to the computers system editor. For editing sequences the input file should have already been created using the listing function "list sequence". Supply the name of the file to edit. Wait while the system editor is made ready (can take awhile on a vax). Use the editor. Exit from the editor. If a sequence has been edited, and you want to process it, affirm that the sequence should be "made active". The edited sequence will replace the original sequence. This editing method is designed to give users access to an editor with which they are familiar - i.e. the one on their machine, and yet to allow them to edit a sequence which contains the landmarks they need in order to know where they are. Users can create files containing simple listings with numbering, using "list the sequence", and then edit them with their system editor, using the numbering to know where they are within the sequence. When the edits are complete they exit from the editor and the program "analyses" the edited file to extract only the sequence characters. Define the permitted set of characters to be: ACDEFGHIKLMNPQRSTVWXYZ-acdefghiklmnpqrstvwxyz. All permitted characters found in the file will become part of the sequence, all others removed. @10. TX 2 @Clear graphics Clears the screen of both text and graphics. @11. TX 2 @Clear text Clears only text from the screen. @12. TX 2 @Draw a ruler This option allows the user to draw a ruler or scale along the x axis of the screen to help identify the coordinates of points of interest. The user can define the position of the first amino acid to be marked (for example if the active region is 1501 to 8000, the user might wish to mark every 1000th amino acid starting at either 1501 or 2000 - it depends if the user wishes to treat the active region as an independent unit with its own numbering starting at its left edge, or as part of the whole sequence). The user can also define the separation of the ticks on the scale and their height. If required the labelling routine can be used to add numbers to the ticks. @13. TX 2 @Use cross hair This function puts a steerable cross on the screen that can be used to find the coordinates of points in the sequence. The user can move the cross around using the directional keys; when he hits the space bar the program will print out the coordinates of the cross in sequence units and the option will be exited. If instead, you hit a , the position will be displayed but the cross will remain on the screen. If a letter s is hit the sequence around the cross hair is displayed and the cross remains on the screen. @14. TX 2 @Reset margins The positions of each of the plots is defined relative to a users drawing board which has size 1-10,000 in x and 1-10,000 in y. Plots for each option are drawn in a window defined by x0,y0 and xlength,ylength. Where x0,y0 is the position of the bottom left hand corner of the window, and xlength is the width of the window and ylength the height of the window. --------------------------------------------------------- 10,000 1 1 1 -------------------------------------- ^ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ylength 1 1 1 1 1 1 1 1 1 1 1 1 -------------------------------------- v 1 1 x0,y0^ 1 1 <---------------xlength--------------> 1 --------------------------------------------------------- 1 1 10,000 All values are in drawing board units (i.e. 1-10,000, 1-10,000). The default window positions are read from a file "ANALMARG" when the program is started. Users can have their own file if required. As all the plots start at the same position in x and have the same width, x0 and xlength are the same for all options. Generally users will only want to change the start level of the window y0 and its height ylength. This option allows users to change window positions whilst running the program. The routine prompts first for the number of the option that the users wishes to reposition; then for the y start and height; then for the x start and length. Note that changes to the x values affect all options. If the user types only carriage return for any value it will remain unchanged. The cross- hair can be used to choose suitable heights. @15. TX 2 @Label a diagram This routine allows users to label any diagrams they have produced. They are asked to type in a label. When the user types carriage return to finish typing the label the cross-hair appears on the screen. The user can position it anywhere on the screen. If the user types R (for right justify) the label will be written on the diagram with its right end at the cross-hair position. If the user types L (for left justify) the label will be written on the diagram with its left end at the cross hair position. The cross-hair will then immediately reappear. The user may put the same label on another part of the diagram as before or if he hits the space bar he will be asked if he wishes to type in another label. @16. TX 2 @Display a map It is often convenient to plot a map alongside graphed analysis in order to indicate features within the sequence. This function allows users to draw maps using files arranged in the form of EMBL feature tables. Of course the EMBL table are usually only used for nucleic acid sequence annotation but, as long as the features are written in the correct format, they can be employed by this routine. The map is composed of a line representing the sequence and then further lines denoting the endpoints of each feature the user identifies. The user is asked to define height at which the line representing the sequence should be drawn; then for the feature height; then for the features to plot. @17. TX 1 5 @Short sequence search This routine is used to search for exact matches to short sequences. It is equivalent to the restriction enzyme search in program NIP. It and can either list matches or present the results graphically. Select from searching, screen clearing or file listing. Choose a file of strings and the mode of output required. The files of short sequences (strings) and their names need to be arranged in a particular way. For example ACID/D/E// BASIC/R/K/H// HYDRO/F/L/I/V/Y// GLYCO/N-S/N-T// +/R/K/H// -/D/E// defines various groups of amino acids. Each string or set of strings must be preceded by a name, each string must be preceded and terminated with a slash (/), and each set of strings by 2 slashes. These collections of strings and their names can be read from disk or entered from the keyboard. Two files containing sequences are currently available. One contains named groups of amino acids. The other simply contains the names of all amino acids and gives a convenient way of producing a plot of the positions of all the different amino acids in the sequence. The user can select strings by name from these collections. Results can be displayed name by name or all together. Strings entered from the keyboard need to be separated by slash characters(/). For the name by name search the output looks like: MATCHES= 12 NAME SEQUENCE POSITION FRAGMENT LENGTHS ACID E 7 7 1 ACID E 10 3 1 ACID E 24 14 1 ACID E 28 4 1 ACID D 36 8 1 ACID D 46 10 2 ACID E 51 5 2 ACID E 67 16 2 ACID D 69 2 2 ACID D 81 12 2 ACID E 84 3 2 ACID E 96 12 3 MATCHES= 10 NAME SEQUENCE POSITION FRAGMENT LENGTHS BASIC K 13 13 1 BASIC R 15 2 1 BASIC H 26 11 1 BASIC R 40 14 1 BASIC H 42 2 2 BASIC R 59 17 2 BASIC R 68 9 2 BASIC K 87 19 2 BASIC K 89 2 2 BASIC R 93 4 2 MATCHES= 1 NAME SEQUENCE POSITION FRAGMENT LENGTHS GLYCO NST 4 4 3 or when the results are ordered only on position the output looks like: NAME SEQUENCE POSITION FRAGMENT LENGTHS GLYCO NST 4 3 ACID E 7 3 ACID E 10 3 BASIC K 13 3 BASIC R 15 2 ACID E 24 9 BASIC H 26 2 ACID E 28 2 ACID D 36 8 BASIC R 40 4 BASIC H 42 2 ACID D 46 4 ACID E 51 5 BASIC R 59 8 Graphical output marks the position of each string by a short vertical line and gives its name at the left end of the line. If the top of the screen is reached the program gives the user the oportunity to take a hard copy and then will clear the screen and restart plotting results at the original start position. Note that any character in the string that is not a recognisable protein symbol will be treated as a wild card character will match with all characters in the searched sequence. Typical dialogue follows. Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=17 Search for short sequences X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 =2 1 All acids X 2 Named groups 3 Personal file 4 Keyboard ? 0,1,2,3,4 = ACID/D/E// BASIC/R/K/H// HYDRO/F/L/I/V/Y// GLYCO/N-S/N-T// +/R/K/H// -/D/E// DIBASIC/RR/KK/RK/KR// TURN/N/D/G/P/S// BLOCK/A/Q/E/I/L/M/F/W/V// INDIF/R/C/H/K/T/Y// End of file X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All acids X 2 Named groups 3 Personal file 4 Keyboard ? 0,1,2,3,4 = ? (y/n) (y) All names n ? Name=acid ? Name=basic ? Name=glyco ? Name= ? (y/n) (y) Show results name by name ? (y/n) (y) List matches searching matches= 59 NAME SEQUENCE POSITION FRAGMENT LENGTHS ACID E 7 7 1 ACID E 10 3 1 ACID E 24 14 1 ACID E 28 4 1 ACID D 36 8 1 ACID D 46 10 2 ACID E 51 5 2 ACID E 67 16 2 ACID D 69 2 2 ACID D 81 12 2 ACID E 84 3 2 ACID E 96 12 3 ACID D 116 20 3 matches= 61 NAME SEQUENCE POSITION FRAGMENT LENGTHS BASIC K 13 13 1 BASIC R 15 2 1 BASIC H 26 11 1 BASIC R 40 14 1 BASIC H 42 2 2 BASIC R 59 17 2 ...etc matches= 2 NAME SEQUENCE POSITION FRAGMENT LENGTHS GLYCO NST 4 4 3 GLYCO NQT 487 483 28 28 483 X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All acids X 2 Named groups 3 Personal file 4 Keyboard ? 0,1,2,3,4 = ? (y/n) (y) Selected names ? Name=basic ? Name=glyco ? Name= ? (y/n) (y) Show results name by name n ? (y/n) (y) List matches searching NAME SEQUENCE POSITION FRAGMENT LENGTHS GLYCO NST 4 3 BASIC K 13 9 BASIC R 15 2 BASIC H 26 11 BASIC R 40 14 BASIC H 42 2 BASIC R 59 17 BASIC R 68 9 BASIC K 87 19 ...etc BASIC R 477 14 BASIC H 479 2 GLYCO NQT 487 8 BASIC K 499 12 BASIC K 501 2 BASIC K 508 7 7 X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All acids X 2 Named groups 3 Personal file 4 Keyboard ? 0,1,2,3,4 =4 Define search strings by typing a string name followed by the string(s) ? Name=MARY ? String(s)=AL/VI ? Name= ? (y/n) (y) All names ? (y/n) (y) Show results name by name ? (y/n) (y) List matches searching matches= 12 NAME SEQUENCE POSITION FRAGMENT LENGTHS MARY VI 38 38 10 MARY AL 63 25 13 MARY VI 136 73 16 MARY AL 177 41 19 MARY AL 217 40 25 MARY AL 233 16 37 MARY AL 243 10 40 MARY AL 256 13 41 MARY AL 326 70 45 MARY VI 345 19 51 MARY AL 396 51 70 MARY AL 470 74 73 @18. TX 1 5 @Compare a sequence This routine slides a short sequence along the current sequence and finds all positions at which a given percentage of the amino acids match. Output is in both graphical and listed forms. If users call for dialogue when the routine is selected they will be given the choice of keyboard or file input. Define the string, and the percentage match. Matches will be plotted out and then the user can select to have them listed. Then the routine cycles around. The routine slides the search string along the sequence and marks the positions at which a minimum percentage score is reached. The graphical output draws a vertical line at the match position; the height of the line represents the percentage score, so that if the line reaches the top of the box the score is 100%. Typical dialogue follows. ? Menu or option number=18 Find percentage matches ? (y/n) (y) Keep picture ? String=aaa ? Percent match (1.00-100.00) (70.00) = missing graphics Total scoring positions above 70.000 percent = 19 Scores 2 2 2 2 2 2 2 2 2 2 Positions 61 131 177 217 226 231 232 267 300 301 ? Number to list (0-19) (0) =3 61 AIA * * aaa 1 131 AIA * * aaa 1 177 ALA * * aaa 1 ? (y/n) (y) Keep picture n Default String=aaa ? String=! @19. TX 1 5 @Compare a sequence using a score matrix This routine slides a short sequence along the current sequence and finds all positions at which a given level of similarity (a cutoff score) is reached. The score is defined by use of a score matrix (MDM78). Output is in both graphical and listed forms. If users call for dialogue when the routine is selected they will be given the choice of keyboard or file input. Define the string and the cutoff score. Matches will be plotted out and then the user can select to have them listed. Then the routine cycles around. The routine slides the search string along the sequence and marks the positions at which a the cutoff score is achieved. The graphical output draws a vertical line at the match position; the height of the line represents the score, so that if the line reaches the top of the box the score is the maximum possible. Typical dialogue follows. Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=19 Find matches using a score matrix ? (y/n) (y) Keep picture ? String=aaa Minimum score= 12 Maximum score= 36 ? Score (12-36) (36) = missing graphics For score 24 the number of matches= 507 scores 35 35 35 34 34 34 34 34 34 34 positions 226 231 379 112 133 202 227 267 378 380 ? Number to list (0-507) (0) =3 226 ATA * * aaa 1 231 SAA ** aaa 1 379 GAA ** aaa 1 ? (y/n) (y) Keep picture n Default String=aaa ? String=! @20. TX 5 @Search for a motif using a weight matrix This function performs searches for short sequence motifs using an appropriate weight matrix. In addition it can be used to create or modify weight matrices. In order to perform a search the only input required is the name of the file containing the weight matrix. The results can be presented graphically or listed. The graphical presentation will draw line at the position of any matches found; the height of the line is proportional to the score. For a search, select "use weight matrix", supply the name of the file containing the weight matrix, and choose between having results plotted or listed. If dialogue is requested when the function is selected users can alter the cutoff score employed. To create a weight matrix several steps are involved. A file containing an alignment of known motifs is required. (This file must be created before the current option is selected. The format is a follows: each sequence is written on a separate line with at least one space at the beginning; each sequence is terminated by a space character, and can be followed by a name. The sequences must be aligned.) Supply the name of the file of aligned sequences. The program reads and displays the sequences. Choose between "summing logs of weights" or summing weights (i.e. whether to multiply or add weights). If logs are used all scores will be negative. Choose if all positions in the set of aligned sequences should be used or if a mask should be applied. If so selected, define a mask as a string of symbols, in which symbol - means ignore and any other symbol means use. E.g. xx-x--abc means use all positions except 3,5 and 6. The program will calculate weights as the frequencies of each amino acid at each unmasked position in the set of aligned sequences. These weights are then applied to the set of aligned sequences to give a range of "observed" scores. The mean and standard deviation of these scores is displayed. The user is asked to supply several values to be used when the weight matrix is applied to other sequences: a cutoff score (by default, the mean minus 3 standard deviations), a top score for scaling graphical results (by default, the mean plus 3 standard deviations), and a position to identify (this means that if a particular amino acid within the motif is used as a "landmark", such as the G of the helix-turn-helix motif, then its position will be marked in plots). All these values are stored along with the weight matrix. Finally supply the name of a file to contain the weight matrix. Weight matrices can be "rescaled" using a set of aligned sequences in much the same ways as a matrix is created. The purpose is to redefine the cutoff scores, and rescaling does not alter any other values in the weight matrix file. The methods have changed considerably but were first outlined in Staden, R. Nucl. Acid Res. 12 505-519 1984, and Staden, R. Genetic engineering: principles and methods vol 7, Edited by J.K. Setlow and A. Hollaender, Plenum publishing corp., 1985. The methods have always had to deal with the problem of zeroes in the matrices. The current versions employ "Laplaces Law of Succession" in which 1 is added to each term. It is now possible to apply a mask to a set of aligned sequences in order to give weight to selected positions only. Sequences have superimposed functions: some parts may be of general structural importance and give rise to an overall framework, and other parts give specificity and hence are not common; we may want to use a set of aligned sequences to define a motif, but want to use only the framework positions. Alternatively we may want to pick out only those parts of a set of aligned sequences that give a particular property, and to ignore other similarities that are due to some other property and which could obscure the pattern we are interested in. The ability to define a mask allows certain positions to be used in the motif and others to be ignored, and yet still permits the use of a set of aligned sequences to calculate weights. Typical dialogue is shown below. ? Menu or option number=20 X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 =2 ? Name of aligned sequences file=[rs.motifs]hth.seq 1 QESVADKMGMGQSGVGALFN LAMBDA.REP 2 QTKTAKDLGVYQSAINKAIH LAMBDA.CRO 3 QAALGKMVGVSNVAISQWQR P22.REP 4 QRAVAKALGISDAAVSQWKE P22.CRO 5 QAELAQKVGTTQQSIEQLEN 434.REP 6 QTELATKAGVKQQSIQLIEA 434.CRO 7 RQEIGQIVGCSRETVGRILK CAP 8 RGDIGNYLGLTVETISRLLG Fnr 9 LYDVAEYAGVSYQTVSRVVN LAC.R 10 IKDVARLAGVSVATVSRVIN GAL.R 11 TEKTAEAVGVDKSQISRWKR LAMBDA.CII 12 QRKVADALGINESQISRWKG P22.CI 13 KEEVAKKCGITPLQVRVWCN MAT.ALPHA 14 TRKLAQKLGVEQPTLYWHVK TETR.TN10 15 TRRLAERLGVQQPALYWHFK TETR.pSC1 16 QRELKNELGAGIATITRGSN TRP.REP 17 RQQLAIIFGIGVSTLYRYFP H-INVERSN 18 ATEIAHQLSIARSTVYKILE TN3.RESOL 19 ASHISKTMNIARSTVYKVIN GD.RESOLV 20 IASVAQHVCLSPSRLSHLFR ARA.C 21 RAEIAQRLGFRSPNAAEEHL LEX.R Length of motif 20 ? (y/n) (y) Sum logs of weights ? (y/n) (y) Use all motif positions n x means use, - means ignore e.g. xx-x---x-x means use positions 1,2,4,8,10 ? Mask=--xxxxxxxxxxxx------ Applying weights to input sequences 1 -57.143 QESVADKMGMGQSGVGALFN 2 -55.087 QTKTAKDLGVYQSAINKAIH 3 -58.079 QAALGKMVGVSNVAISQWQR 4 -54.986 QRAVAKALGISDAAVSQWKE 5 -55.181 QAELAQKVGTTQQSIEQLEN 6 -55.874 QTELATKAGVKQQSIQLIEA 7 -56.692 RQEIGQIVGCSRETVGRILK 8 -57.722 RGDIGNYLGLTVETISRLLG 9 -55.363 LYDVAEYAGVSYQTVSRVVN 10 -55.769 IKDVARLAGVSVATVSRVIN 11 -56.786 TEKTAEAVGVDKSQISRWKR 12 -55.833 QRKVADALGINESQISRWKG 13 -56.279 KEEVAKKCGITPLQVRVWCN 14 -53.125 TRKLAQKLGVEQPTLYWHVK 15 -55.833 TRRLAERLGVQQPALYWHFK 16 -58.651 QRELKNELGAGIATITRGSN 17 -56.749 RQQLAIIFGIGVSTLYRYFP 18 -56.986 ATEIAHQLSIARSTVYKILE 19 -60.618 ASHISKTMNIARSTVYKVIN 20 -58.988 IASVAQHVCLSPSRLSHLFR 21 -58.002 RAEIAQRLGFRSPNAAEEHL Top score -53.125 Bottom score -60.618 Mean -56.655 Standard deviation 1.617 Mean minus 3.sd -61.505 Mean plus 3.sd -51.804 ? Cutoff score (-999.00-9999.00) (-61.51) = ? Top score for scaling plots (-61.51-999.00) (-51.80) = ? Position to identify (0-20) (1) =9 ? Title=hth ? Name for new weight matrix file=1.wts Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=20 X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 = ? Motif weight matrix file=1.wts hth ? (y/n) (y) Use frequencies as weights ? (y/n) (y) Plot results n 5 -61.46 STEISELIKQRIAQFNVVSE 13 -58.93 KQRIAQFNVVSEAHNEGTIV 21 -60.42 VVSEAHNEGTIVSVSDGVIR 57 -59.39 GNRYAIALNLERDSVGAVVM 59 -61.47 RYAIALNLERDSVGAVVMGP 79 -59.90 YADLAEGMKVKCTGRILEVP 88 -61.41 VKCTGRILEVPVGRGLLGRV 104 -60.38 LGRVVNTLGAPIDGKGPLDH 127 -60.13 SAVEAIAPGVIERQSVDQPV 129 -59.91 VEAIAPGVIERQSVDQPVQT 133 -60.79 APGVIERQSVDQPVQTGYKA 139 -61.12 RQSVDQPVQTGYKAVDSMIP 175 -58.90 KTALAIDAIINQRDSGIKCI 191 -60.95 IKCIYVAIGQKASTISNVVR 195 -60.94 YVAIGQKASTISNVVRKLEE 215 -60.66 HGALANTIVVVATASESAAL 254 -60.56 EDALIIYDDLSKQAVAYRQI 260 -60.08 YDDLSKQAVAYRQISLLLRR 297 -61.00 LLERAARVNAEYVEAFTKGE 314 -61.29 KGEVKGKTGSLTALPIIETQ 330 -60.49 IETQAGDVSAFVPTNVISIT 363 -57.63 GIRPAVNPGISVSRVGGAAQ 365 -61.48 RPAVNPGISVSRVGGAAQTK 371 -61.02 GISVSRVGGAAQTKIMKKLS 382 -57.90 QTKIMKKLSGGIRTALAQYR 394 -60.07 RTALAQYRELAAFSQFASDL 424 -59.95 GQKVTELLKQKQYAPMSVAQ 430 -58.89 LLKQKQYAPMSVAQQSLVLF 432 -61.14 KQKQYAPMSVAQQSLVLFAA 438 -58.58 PMSVAQQSLVLFAAERGYLA 458 -61.06 DVELSKIGSFEAALLAYVDR 466 -61.00 SFEAALLAYVDRDHAPLMQE 483 -60.48 MQEINQTGGYNDEIEGKLKG 494 -60.61 DEIEGKLKGILDSFKATQSW Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=d20 X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 = ? Motif weight matrix file=1.wts hth ? (y/n) (y) Use frequencies as weights ? Cutoff score (-9999.00-9999.00) (-61.51) =-56. ? (y/n) (y) Plot results n @21. TX 3 @Calculate amino acid composition This function calculates the amino acid composition and molecular weight for the active region. ? Menu or option number=21 Sequence composition A C S T P A G N D E Q B Z H N 3. 32. 23. 18. 57. 47. 16. 28. 31. 28. 0. 0. 7. % 0.6 6.2 4.5 3.5 11.1 9.1 3.1 5.4 6.0 5.4 0.0 0.0 1.4 W 309. 2786. 2325. 1748. 4051. 2682. 1826. 3222. 4003. 3588. 0. 0. 960. A R K M I L V F Y W - X ? N 30. 24. 11. 40. 47. 41. 14. 15. 1. 0. 0. 0. 1. % 5.8 4.7 2.1 7.8 9.1 8.0 2.7 2.9 0.2 0.0 0.0 0.0 0.2 W 4686. 3076. 1443. 4527. 5319. 4065. 2060. 2448. 186. 0. 0. 0. 0. Total molecular weight= 55328. @22. TX 3 4 @Plot hydrophobicity This routine plots the hydrophobicity of each section of the sequence using the hydrophobicity values of Kyte and Doolittle (J. Mol. Biol. 157, 105-132 (1982)). A window of size span is slid along the sequence and a sum calculated for each position. If dialogue is requested select a span length and a plot interval. The diagrams are on the same scale as Fig. 6 of the Kyte and Doolittle paper and values of + and - 50 could be assigned to the top and bottom of the diagram with corresponding values in between (-40,-20,0,20,40 are shown in the paper). ? Menu or option number=d22 Plot hydrophobicity ? odd span length (1-101) (11) = ? plot interval (1-101) (3) = missing graphics @23. TX 3 4 @Plot charge This routine plots the charge of each section of the sequence. A window of size span is slid along the sequence and a sum calculated for each position. Amino acids are assigned charges of 1, -1 or 0. If dialogue is requested select a span length and a plot interval. Typical dialogue follows. ? Menu or option number=d23 Plot charge ? odd span length (1-101) (11) = ? plot interval (1-101) (3) = missing graphics @24. TX 4 @Plot robson prediction This routine uses the method of Garnier J, Osguthorpe D J, and Robson B. (1978) J. Mol. Biol. 120, 97-120 to predict secondary structures. The method divides protein secondary structures into 4 classes: helix, extended (usually referred to as sheet), turn and coil. The routine calculates the likelihood that each segment of the sequence lies in each of these classes. Results are presented graphically or listed. If dialogue is requested choose between plotted or listed output. Each residue has a certain probability of being found in each of the 4 classes. This probability depends both on its own amino acid type and also the 8 amino acids found to either side along the protein chain. Four tables of weights, each 20 by 17 elements are used to calculate the likelihood that each residue along the chain falls into one of the four classes of structure. The most likely structure at each point is the one with the highest score. The four values are plotted in strips labelled H, E, T and C. Below, a strip labelled D for decision is divided into four levels, each corresponding to one of the four structure types. Their top to bottom order is the same as that for the strips above, i.e C, T, E, and H. For each residue the program measures which of the four likelhoods is highest. It places a single dot at the mid-point of the corresponding strip, and also at the appropriate level in the strip labelled D. It should be noted that the method, when tested by Kabsch W and Sander C, (1983) Febs. Lett. 155 (179-182), although one of the better ones, was correct for only about 56% of residues. Typical dialogue follows. ? Menu or option number=d24 Plot Robson secondary structure predictions ? (y/n) (y) Plot results n 9 S 217 -7 -39 15 10 E 226 5 -27 -39 11 L 233 -7 -26 -15 12 I 229 -23 9 4 13 K 214 -8 10 -8 14 Q 178 42 19 5 15 R 131 54 16 3 16 I 86 42 -31 -23 17 A 55 52 -30 -15 18 Q 15 67 4 25 19 F -34 86 47 74 20 N -41 74 17 106 21 V -16 118 -5 100 22 V 64 88 5 115 23 S 96 38 26 155 24 E 133 -25 13 96 25 A 118 -98 25 100 26 H 110 -150 37 86 27 N 57 -201 37 66 28 E 51 -140 11 -4 29 G 2 -77 37 9 30 T 2 28 28 7 31 I -11 117 -21 22 32 V -23 178 -55 5 33 S -54 193 -14 35 34 V -46 123 5 30 35 S -54 53 51 80 36 D -60 1 86 55 37 G -66 8 57 49 38 V -1 128 -30 -5 39 I 11 212 -56 -33 40 R 16 204 -44 -57 ...etc @26. TX 4 @Draw a helix wheel A helical representation of segments of the sequence is shown. The display includes a schematic of the helix showing the links between residues, with each vertex numbered according to position; the sequence element at each vertex; a symbol denoting a classification as hydrophobic(.), positively charged(+), negatively charged(-), or otherwise( ). The residue number of the first sequence element in the current window is displayed at the top- left-hand corner of the diagram. Also at the top-left corner the sequence in the current window is listed. Below this is the total hydrophobicity and hydrophobic moment for the window calculated according to Eisenberg et al J. Mol. Biol. 179, 125-142 (1984). If dialogue is requested the user is asked for the angle to define the turn between residues as seen looking along the helix, and a window length. The window length can be up to 60, with default 18, and the angle has a default of 100 degrees. Note that 18 x 100 is 5 turns. When the option is selected the first segment in the current active region is displayed then the bell rings. If the user types only return, the display will click on by one residue; if another number is typed, say N, then the display will click forwards (or backwards if N is negative) by N residues. If the wheel runs off either end of the sequence the option will be exited. Typical dialogue follows. ? Menu or option number=d26 ? Angle (1-130) (100) = ? Window (1-60) (18) = missing graphics @25. TX 3 4 @Plot hydrophobic moment This routine plots hydrophobic moment and hydrophobicity according to Eisenberg et al J. Mol. Biol. 179, 125-142 (1984). The mean hydrophobicity per residue in the window is plotted on a scale -1.0 to 1.5, and the mean hydrophobic moment per residue on a scale 0.0 to 1.5. The hydrophobicity is shown in the top frame with the hydrophobic moment below. The plot is arranged so that the value shown at position x represents the mean value for residues x- window+1 to x, where window is the window length. If dialogue is requested the user can select a window length, and the angle used for the hydrophobic moment calculation. Note that according to Eisenberg et al, in transmembrane proteins an "initiator" is required. This is either a very hydrophobic single helix with >=0.68, or a moderately hydrophobic pair of helices whose sum to >= 1.1. Other helices are then accepted as transmembrane if their >= 0.42 The following rules are claimed: if < 0.51 and points lie below the line = -0.392 + 0.603x they are "globular", if they lie above this line they are "surface". If > 0.51 and they lie above the line = 0.6 - 0.342x they are "monomeric", if above "multimeric". Typical dialogue follows. ? Menu or option number=d25 ? Angle (1-130) (100) = ? Window (1-60) (18) = ? Plot interval (1-101) (3) = missing graphics @27. TX 1 @Back translate to dna This routine back translates protein sequences into DNA using the standard genetic code. The level of redundancy can be plotted and the backtranslation saved to a file. The translation can use either the IUB symbols shown below, or a set of codon preferences. If a set of codon preferences are used they must conform to the format of codon tables produced by the nucleotide analysis program, and the back translation will contain the favoured codons. If there is no favoured codon the IUB symbols will be employed. The window length for plotting the redundancy is in codons. The program will plot the redundancy along the sequence and hence can be used to find the best sequences to use as primers. Note that the program plots the inverse, and so the higher the plot the LESS redundant the sequence. For primers look for peaks rather than troughs. The DNA sequence can be saved to a file and analysed using the nucleotide analysis program. Depending on the application it is often useful to produce a back translation using both a table of codon preferences and one using the IUB symbols. This is because the restriction enzyme search program can distinguish between definite and possible cuts in the sequence. These matches are what the program terms "definite matches" and are ones in which the specification of the recognition sequence corresponds exactly to that of the back translation. The program will also find what it terms "possible matches" which are ones that depend on the particular codons chosen for each amino acid. These are sites at which recognition sequences could be engineered to produce a cut in the DNA without changing the amino acid, but which are not necessarily found in the original sequence. NC-IUB SYMBOLS A,C,G,T R (A,R) 'puRine' Y (T,C) 'pYrimidine' W (A,T) 'Weak' S (C,G) 'Strong' M (A,C) 'aMino' K (G,T) 'Keto' H (A,T,C) 'not G' B (G,C,T) 'not A' V (G,A,C) 'not T' D (G,A,T) 'not C' N (G,A,C,T) 'aNy' Typical dialogue follows. ? Menu or option number=d27 Back translate ? (y/n) (y) No codon preference ? (y/n) (y) Plot redundancy n ? (y/n) (y) Save DNA to disk ? File name for DNA sequence=tt: ATGCARYTNAAYWSNACNGARATHWSNGARYTNATHAARCARMGNATHGCNCARTTYAAY GTNGTNWSNGARGCNCAYAAYGARGGNACNATHGTNWSNGTNWSNGAYGGNGTNATHMGN ATHCAYGGNYTNGCNGAYTGYATGCARGGNGARATGATHWSNYTNCCNGGNAAYMGNTAY GCNATHGCNYTNAAYYTNGARMGNGAYWSNGTNGGNGCNGTNGTNATGGGNCCNTAYGCN GAYYTNGCNGARGGNATGAARGTNAARTGYACNGGNMGNATHYTNGARGTNCCNGTNGGN MGNGGNYTNYTNGGNMGNGTNGTNAAYACNYTNGGNGCNCCNATHGAYGGNAARGGNCCN YTNGAYCAYGAYGGNTTYWSNGCNGTNGARGCNATHGCNCCNGGNGTNATHGARMGNCAR WSNGTNGAYCARCCNGTNCARACNGGNTAYAARGCNGTNGAYWSNATGATHCCNATHGGN MGNGGNCARMGNGARYTNATHATHGGNGAYMGNCARACNGGNAARACNGCNYTNGCNATH GAYGCNATHATHAAYCARMGNGAYWSNGGNATHAARTGYATHTAYGTNGCNATHGGNCAR AARGCNWSNACNATHWSNAAYGTNGTNMGNAARYTNGARGARCAYGGNGCNYTNGCNAAY ACNATHGTNGTNGTNGCNACNGCNWSNGARWSNGCNGCNYTNCARTAYYTNGCNMGNATG CCNGTNGCNYTNATGGGNGARTAYTTYMGNGAYMGNGGNGARGAYGCNYTNATHATHTAY GAYGAYYTNWSNAARCARGCNGTNGCNTAYMGNCARATHWSNYTNYTNYTNMGNMGNCCN CCNGGNMGNGARGCNTTYCCNGGNGAYGTNTTYTAYYTNCAYWSNMGNYTNYTNGARMGN GCNGCNMGNGTNAAYGCNGARTAYGTNGARGCNTTYACNAARGGNGARGTNAARGGNAAR ACNGGNWSNYTNACNGCNYTNCCNATHATHGARACNCARGCNGGNGAYGTNWSNGCNTTY GTNCCNACNAAYGTNATHWSNATHACNGAYGGNCARATHTTYYTNGARACNAAYYTNTTY AAYGCNGGNATHMGNCCNGCNGTNAAYCCNGGNATHWSNGTNWSNMGNGTNGGNGGNGCN GCNCARACNAARATHATGAARAARYTNWSNGGNGGNATHMGNACNGCNYTNGCNCARTAY MGNGARYTNGCNGCNTTYWSNCARTTYGCNWSNGAYYTNGAYGAYGCNACNMGNAARCAR YTNGAYCAYGGNCARAARGTNACNGARYTNYTNAARCARAARCARTAYGCNCCNATGWSN GTNGCNCARCARWSNYTNGTNYTNTTYGCNGCNGARMGNGGNTAYYTNGCNGAYGTNGAR YTNWSNAARATHGGNWSNTTYGARGCNGCNYTNYTNGCNTAYGTNGAYMGNGAYCAYGCN CCNYTNATGCARGARATHAAYCARACNGGNGGNTAYAAYGAYGARATHGARGGNAARYTN AARGGNATHYTNGAYWSNTTYAARGCNACNCARWSNTGG--- @28. TX 5 @Search for patterns of motifs This option searches for patterns of motifs. Patterns can be defined interactively or read from files. Results can be displayed in several ways in both graphical and textual form. Used to create pattern files for searching libraries. The option is extremely flexible and consequently the following documentation is quite lengthy. However the routine is capable of searching for almost any known pattern. In addition the flexibility does not necessitate difficulty of use, and the userinterface has been simplified considerably since the methods were first published. Users should refer to the "typical dialogue" shown below for the most helpful information on using the program. There are currently four ways to display the matching patterns: 1=each individual motif and its position is listed; 2=all the sequence between, and including the two outermost motifs is listed; 3=graphical, with a vertical line marking the position of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM field is the motif name, the FROM and TO fields denote the ends of the match, and the DESCRIPTION field is "Program". When it is defined for the first time a pattern must be entered interactively at the keyboard, but the pattern description can be saved to a file. This file can be used for all subsequent searches. When defining a pattern interactively select a motif class and the program will request the required inputs. The program gives each motif an identifying name and number. For motifs other than the first, a range of allowed positions must be defined (Note that sets of motifs included using the OR operator will all be given the same range, and so the program will only request range values for the first motif in any such set). To specify the allowed range for a motif the user must supply the following: the identifying number of the motif, relative to which the current motifs positions are to be defined (termed the "reference motif"); a "relative start position" and a range. The relative start position can be negative or positive. A negative start position means that although the reference motif is searched for first, the current motif can be found to its left. A zero relative start position means their left ends are superimposed. The default start position is to butt-joint the motif to righthand end of the "reference motif". The range is "the number of extra positions" that the motif can take. The program will display the probability of finding each motif. These values are presented in the following form: .1234E-5 means 0.1234 times 10 to the power -5. After the pattern has been defined, the program will type a description of it on the screen. It will then allow the user to give an overall cutoff score and overall probability cutoff. Typical dialogue for all the different motif classes is displayed below. ? Menu or option number=28 Pattern searcher ? (y/n) (y) Read pattern from keyboard X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 = ? Motif name=aa ? String=aa Probability of score 2.0000 = 0.123E-01 X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =2 ? Motif name=pmatch X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-1) (1) = ? Relative start position (-1000-1000) (3) = ? Number of extra positions (0-1000) (0) = ? String=qqq ? Minimum matches (1.00-3.00) (3.00) =2 Probability of score 2.0000 = 0.858E-02 1 Exact match X 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =3 ? Motif name=sm X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-2) (2) = ? Relative start position (-1000-1000) (4) = ? Number of extra positions (0-1000) (0) = ? String=wqa ? Minimum score (11.00-53.00) (53.00) =36 Probability of score 36.0000 = 0.531E-02 1 Exact match 2 Percentage match X 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =4 ? Motif name=hth X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-3) (3) = ? Relative start position (-1000-1000) (4) = ? Number of extra positions (0-1000) (0) = ? Weight matrix file name=hth HELIX TURN HELIX PABO SAUER WEIGHTS 17-11-87 Probability of score -51.5860 = 0.230E-04 1 Exact match 2 Percentage match 3 Cut-off score and score matrix X 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =5 ? Motif name=repeat X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-4) (4) = ? Relative start position (-1000-1000) (21) = ? Number of extra positions (0-1000) (0) =3 ? Repeat length (1-60) (6) =3 ? Minimum gap (0-60) (0) = ? Maximum gap (0-60) (0) =2 ? Minimum score (11.00-60.00) (36.00) = Probability of score 36.0000 = 0.445E-01 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix X 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =6 ? Motif name=mset X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-5) (5) = ? Relative start position (-1000-1000) (1) = ? Number of extra positions (0-1000) (0) = X 1 Keyboard input 2 File input ? 0,1,2 = Separate sets with commas ? String=AVL,AST,,WYRF ? Minimum matches (1.00-4.00) (4.00) =3 Probability of score 3.0000 = 0.718E-02 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat X 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =7 ? (y/n) (y) Save pattern in a file ? Pattern definition file=EXAM.PAT Motif 6 needs a file name to store set as a weight matrix ? Weight matrix file name=DEMO.WTS Weight matrix needs a title ? Title=Demonstration class 6 weight matrix Pattern description Motif 1 named aa is of class 1 Which is an exact match to the string aa Motif 2 named pmatch is of class 2 which is a match of score 2. to the string qqq and the N-terminal residue can take positions 3 to 3 relative to the N-terminal end of motif 1 It is anded with the previous motif. Motif 3 named sm is of class 3 which is a match of score 36. to the string wqa and the N-terminal residue can take positions 4 to 4 relative to the N-terminal end of motif 2 It is anded with the previous motif. Motif 4 named hth is of class 4 Which is a match to a weight matrix with score -51.586 and the N-terminal residue can take positions 4 to 4 relative to the N-terminal end of motif 3 It is anded with the previous motif. Motif 5 named repeat is of class 5 Which is a repeat with repeat length 3 and score 36. The loop-out can have sizes 0 to 2 and the N-terminal residue can take positions 21 to 24 relative to the N-terminal end of motif 4 It is anded with the previous motif. Motif 6 named mset is of class 6 Which is membership of a set with score 3.000 It is anded with the previous motif. Probability of finding pattern = 0.4109E-14 Expected number of matches = 0.2539E-10 ? Maximum pattern probability (0.00-1.00) (1.00) = ? Minimum pattern score (-9999.00-9999.00) (-9999.00) = Select display mode X 1 Motif by motif 2 Inclusive 3 Graphical 4 EMBL feature table ? 0,1,2,3,4 = Searching Total matches found 0 Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=6 Page through text files ? Name of file to read=exam.pat A1 aa Class aa @ End of string A2 pmatch Class 1 Relative motif 3 Relative start position 0 Number of extra positions qqq @ End of string 2.00000 Cutoff A3 sm Class 2 Relative motif 4 Relative start position 0 Number of extra positions wqa @ End of string 36.00000 Cutoff A4 hth Class 3 Relative motif 4 Relative start position 0 Number of extra positions hth File name A5 repeat Class 4 Relative motif 21 Relative start position 3 Number of extra positions 3 Length 0 Minimum loop 2 Maximum loop 36.00000 Cutoff A6 mset Class 5 Relative motif 1 Relative start position 0 Number of extra positions DEMO.WTS File name End of file Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=6 Page through text files ? Name of file to read=demo.wts Demonstration class 6 weight matrix 4 0 3.000 4.000 P 1 2 3 4 N 0 0 0 0 C 0 0 0 0 S 0 1 0 0 T 0 1 0 0 P 0 0 0 0 A 1 1 0 0 G 0 0 0 0 N 0 0 0 0 D 0 0 0 0 E 0 0 0 0 Q 0 0 0 0 B 0 0 0 0 Z 0 0 0 0 H 0 0 0 0 R 0 0 0 1 K 0 0 0 0 M 0 0 0 0 I 0 0 0 0 L 1 0 0 0 V 1 0 0 0 F 0 0 0 1 Y 0 0 0 1 W 0 0 0 1 End of file Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structure m5 = Search ? = Help ! = Quit ? Menu or option number=28 Pattern searcher ? (y/n) (y) Read pattern from keyboard X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =2 ? Motif name=avlst ? String=avlst ? Minimum matches (1.00-5.00) (5.00) =3 Probability of score 3.0000 = 0.394E-02 1 Exact match X 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Direct repeat 6 Membership of set 7 Pattern complete ? 0,1,2,3,4,5,6,7 =7 ? (y/n) (y) Save pattern in a file n Pattern description Motif 1 named avlst is of class 2 which is a match of score 3. to the string avlst Probability of finding pattern = 0.3941E-02 Expected number of matches = 0.2030E+01 ? Maximum pattern probability (0.00-1.00) (1.00) = ? Minimum pattern score (-9999.00-9999.00) (-9999.00) = Select display mode X 1 Motif by motif 2 Inclusive 3 Graphical 4 EMBL feature table ? 0,1,2,3,4 =4 Searching FT avlst 152 156 Program Total matches found 1 Minimum and maximum observed scores 3.00 3.00 General notes These methods allow users to define and search for complex patterns of motifs defined as single objects. The programs allow individual DNA motifs to be defined in eight different ways, and protein motifs in six. Motifs are combined, using the logical operators AND, OR and NOT, to describe a pattern. The pattern also specifies the ranges of allowed relative separations of the individual motifs. First some definitions. A MOTIF is a contiguous subsequence of fixed length. At its simplest it could be a single definite base or amino acid; a more complex motif might be better represented as a consensus or a weight matrix; two more-abstract types of motif are direct and inverted repeats. A PATTERN is a higher order of structure defined by a list of motifs. The motifs in a pattern are combined using the logical operators AND, OR and NOT. The list also defines the allowed relative separations of the motifs. In the current versions of the programs up to 50 motifs can be combined into a single pattern. So using these definitions there are two differences between motifs and patterns: 1) the distances between all elements of a motif are fixed, but the separations of parts of patterns can vary; 2) all characters in a motif are defined using the same method (class), but different parts of a pattern can be defined in completely different ways. Each motif can be represented in 9 ways (known as the motif class): MOTIF CLASSES CLASS DESCRIPTION 1 Exact match to a short defined sequence. The IUB symbols can be used for DNA sequences. 2 Percentage match to a defined short sequence. In nucleic acids, the IUB symbols can be used. 3 Match to a defined sequence, using a score matrix and cutoff score. The DNA matrix (see option 18) gives scores to IUB symbols depending on their level of redundancy. MDM78 is used for proteins. 4 Match to a weight matrix with cutoff score. 5 As class 4 but on the complementary strand. 6 Inverted repeat or stem-loop. Fixed stem length, range of loop sizes, and cutoff score using A-T, G-C=2; G-T=1. 7 Exact match to short sequence but with a defined step size. 8 Direct repeat. Fixed repeat length, range of loop-out sizes, cutoff score, and score matrix (for protein sequences MDM78 and for nucleic acids an identity matrix). 9 Membership of a set. A list of sets of allowed amino acids for each position in the motif. The sets are separated by commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino acids in which one of I,V or L must be found in the first position, then anything in the next two positions, D,E,K or R in the fourth position and F,Y,W,I,L,V or M in the fifth. This class only applies to protein sequences because for nucleic acids "membership of a set" can be achieved using IUB symbols. Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to nucleic acids. Class 1: exact match. The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. All symbols must match. Class 2: percentage match The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. The minimum number of matching characters must also be specified. Class 3: match using a score matrix The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. The motif is not compared directly with the sequence to count the number of matching characters. Instead a matrix is used to provide a score for all possible pairs of characters. The motif score for any position along the sequence is the sum of the scores found by looking-up the scores for each pair of aligned characters. A match is declared if some minimum score is achieved. Class 4: weight matrix The motif is defined by a table of values (called weights or scores). The table gives a score for finding each possible character at each position along the length of the motif. It therefore has dimension motif-length x character-set-size, and allows us to give different scores for each character at each position. It is equivalent to having a different score matrix for each position along the motif, and provides the most flexible and specific method of defining motifs. The weight matrices are created by program PIP option 20 and stored as files. The file contains the values for each position, as well as an overall minimum score. There are two ways in which these values can be used to calculate an overall score for any section of the sequence. The simplest way is to add the values in the file. (This means that the highest possible score can be calculated by adding the top value at each column position, and the lowest by adding the bottom value.) The normal way of using the values in the file is as follows. First the programs divide the values in each column by the column total so that they sum to 1.0 Then the natural logs of these values are used as scores. When the matrix is applied to a sequence these logarithmic values are summed (which is of course equivalent to multiplying the frequencies). Note that using the natural logs of the frequencies as weights and adding them means that the overall cutoff score must be less than zero, whereas if the original values in the weight matrix file are added, the cutoff score will be greater than zero. The search routines therefore decide whether the user wants to add values or multiply frequencies by examining the value of the cutoff score: it will add if the cutoff is greater than zero and add logs of frequencies if it is less than zero. Hence we effectively get two motif classes in one. The program PIP, when creating weight matrix files, will ask the user whether the scores should be added or multiplied. If the values in the table have been defined without using a set of aligned sequences it is easier for the user to choose a cutoff score if the values are added. Class 5: complement of weight matrix The motif is defined by a weight matrix, but the program searches for its complement. Class 6: inverted repeat, or stem-loop The motif is defined by a repeat length, a minimum score and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0. The loop sizes are defined by a minimum and maximum distance from the 3' end of the stem. For a stem-loop these will be positive numbers. For example to define a stem of length 8 and loop sizes varying from 3 to 5, the stem would be set to 8, the minimum start distance to 3 and the maximum to 5. To define an inverted repeat the minimum distance will be negative. For example stem length=9, minimum distance=-9, and maximum distance=-8 will find inverted repeats of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base at its centre, the second having none. Class 7: exact match, defined step size. The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. All symbols must match. The class differs from class 1 in that searches will move in steps of some given size. For example we could search for a certain codon and use a step size of 3 and hence keep in a single reading frame. Class 8: direct repeat The motif is defined by a repeat length, a minimum score and a range of loop sizes. The scores are defined using MDM78 for protein sequences and an identity matrix for nucleic acids. The loop sizes are defined by a minimum and maximum distance from the 3' end of the stem. Class 9: membership of a set This motif class is for protein sequences. It is defined by lists of allowed amino acids for each position in the motif, and a cut-off score. Positions at which any amino acid can occur are left blank. All allowed amino acids for each position give a score of 1. The motifs can be defined in two ways: either typed at the keyboard or read in as a weight-matrix-like file. When the motif is defined at the keyboard the sets of allowed amino acids are separated by commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino acids in which one of I,V or L must be found in the first position, then anything in the next two positions, D,E,K or R in the fourth position and F,Y,W,I,L,V or M in the fifth. To specify that the whole motif must match a score of 3 would be required (i.e. one of the allowed amino acids must be found for each of the three defined positions). If the motif is read from a file the file must have been written by program PIP, or have been saved by the pattern searching routines. If the user elects to save a pattern, and it includes class 9 motifs typed at the keyboard, then the program will save the class 9 motifs as weight matrix files. Therefore it will request file names for each motif of this class. If the motif given above as an example were saved the weight matrix file would have 5 columns. The first column would contain zeroes except for the I, V and L rows which would be set to 1; the next two columns would all be zero; the next would be zero except for the D,E,K and R rows which would be 1; the final column would contain 1's in rows F,Y,W,I,L,V and M, with the rest zero. The logical operator (AND, OR or NOT) used to add each motif to the pattern is specified by preceding the class number by the letters A, O or N. A = AND, O = OR, N = NOT. The default is A, so N2 means include, using the NOT operator, a class 2 motif; O2 means include, using the OR operator, a class 2 motif; both A2 and 2 mean include, using the AND operator, a class 2 motif. Range setting. The motifs in a pattern are numbered according to their order in the list. Apart from the first motif in a pattern all motifs are given a range of allowed positions relative to a motif further up the list. For example suppose we have a pattern defined by A AND B AND C AND D. Motif A can occur anywhere, but B must have its range of allowed positions defined relative to the position of motif A, and C's positions can be defined relative to either A or B, depending on which is most convenient, and likewise D's positions can be relative to A or B or C. Notice that the positions of motifs can be defined relative to more than one motif. Suppose we have a pattern consisting of motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5- 10 residues right of B, and also C is never more than 15 residues from A. Then it is quite consistent with the methods to include motif C into the pattern twice using the AND operator: once relative to A and once relative to B. This will define the relative spacing and the ORDER of the motifs in the pattern. (If we simply defined the position of C relative to A it could be found to the left of B). Motifs combined together using the OR operator are all given the same range. For example suppose we had a pattern A AND (B OR C) AND (D OR E), then B and C each have the same range, and D and E also have the same range as one another. The range for D and E can be relative to A or to B. Motifs cannot have their ranges defined relative to motifs that are included using the NOT operator. For example if we had the pattern A NOT B AND C, then the range for C can only be defined relative to motif A. Speed can be gained by arranging the order of the motifs so that those higher up the list are of types that can be searched for rapidly and that are also unlikely to be found. Motifs combined by the OR operator are alternatives: if any one of a set of motifs combined by the OR operator is found, then a match is declared. All alternatives will be reported. For example if we had a pattern defined by A AND (B OR C), then all places where A occurs and B is found within range, and all places where A is found and C is found within range will be reported. A typical use would be where we might allow a motif to appear on either strand of the DNA sequence. For example a weight matrix representing the heatshock element could be used in a pattern which included heatshock as a motif class 4 combined using the OR operator with heatshock as a motif class 5. The probability calculations are performed for each motif as it is defined. If an overall probability cut-off is given the calculation is repeated for each match found. To achieve maximum searching speed do not give an overall probability cut-off. Overall cut-off scores should only be used if the motif classes used are compatible. There are currently several ways to display the matches: 1 = each motif and its position is listed; 2 = all the sequence between the two outermost motifs is listed; 3 = graphical, with a spike marking the position of the leftmost motif. The library versions also give entry names, and a one line title; in addition they can be used to produce aligned families of sequences. When this mode of output is selected the program will write a separate file for each match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the name of the entry in the library. The matching sequence will be written out so that the spacing between motifs is constant, and set to the maximum allowed by the pattern definition. Any gaps will be filled with dashes (-). If the individual sequences were subsequently written one above the other they should line up so that all motifs are in register. There two types of output of this sort: one, option 4, writes out whole sequences, the other, option 5, writes out only the sequences between the two outermost motifs. If the individual sequences were subsequently written one above the other they should line up so that all motifs are in register. There two types of output of this sort: one, option 4, writes out whole sequences, the other, option 5, writes out only the sequences between the two outermost motifs. Note that for option 4 users are asked to type the position of the first motif, and the reason for this is explained below. Consider a pattern found in several sequences. Consider only the first motif in the pattern and suppose that it was found in different positions in these sequences. Say that of these positions the one furthest from the left end was position 100. Then, in order to ensure that all the sequences would align, we must specify that motif 1 must start at position 100. Any sequences in which motif 1 started nearer to the left end than position 100 would be padded accordingly. These modes of output should only be used when the position of each motif is defined relative to its immediate neighbour. The pattern descriptions can be saved to files. These files can be used instead of typing definitions again at the keyboard. As the files are annotated, they can easily be changed using system editors, and the modified versions used to define the variant patterns for the programs. Use of lists of entry names The two programs that operate on libraries have the ability to restrict their searches to subsets of the libraries. This does not require sublibraries to be created but instead is achieved by using files containing a list of the entry names of sequences. The user may choose to search only those entries on the list or, alternatively to search all but those on the list (i.e. in the latter case the list contains the names of those to be excluded). The programs can search libraries that have indexes and those that do not. If a list of names for inclusion is used, then the search will be faster if the index is present. In all other circumstances the whole library will be read. The list must be in library order except when it is used to include entries, and an index is available. The list must contain each entry name on a separate line, with the name starting in column 1 of the line. ie there must be no spaces at the start of the line. The list of entry names can be produced by the keyword searches of nip, pip, etc as long as the listings produced have a space character separating the entry name from the entry description. This will depend on how well the library reformatting programs work. For example swissprot entry names tend to run into the beginning of the descriptions, but other libraries are generally OK. One use of the programs is to look for patterns that we already know about, but in new sequences. However it is hoped that they will also be useful for finding new motifs. For example several known control regions in nucleic acid sequences consist of particular direct or inverted repeats; the inclusion of direct and inverted repeats as motif classes makes it possible to find previously unknown motifs of these types. Using these new programs we can ask questions like: "are there any inverted or direct repeats near to sections of sequence that contain both a CCAAT box and a TATA box?"; and to search for such things throughout the libraries. In addition, the mode of output in which all the sequence between the two outermost motifs found is printed out, allows us to extract sequences and examine them in more detail for further common subsequences. For example we might want to collect together all the sequences between putative CCAAT and TATA boxes. A further use of the inverted repeat motif class is the following. If a regulatory sequence in DNA is poorly defined but also an inverted repeat, then it might be an advantage to specify it both as a consensus sequence and a superimposed inverted repeat. In this way two weak definitions can be combined to produce a stronger pattern. Given only a few examples of a motif it should be possible to perform initial searches using a class 3 motif, and then, using plausible matching sequences, create a more specific weight matrix for the same motif. If motifs are combined with the first motif using the OR operator they will be ignored until all permutations that include the first motif have been looked for. The whole search will then be repeated, in turn, for each of those motifs that are combined with the first motif using the OR operator. An interesting consequence of this is that the program can be used, without change, to compare any newly determined sequence with all known individual motifs. We achieve this by having a pattern in which all known relevant motifs are combined using the OR operator. If we ask to use this pattern with a sequence, the program will automatically compare each individual motif in the pattern with the whole length of the sequence. As the number of known motifs grows this should become an increasingly useful standard procedure. The NOT operator is obviously useful for making sure particular motifs are not present, but it can also be used to bracket the levels of matches found. We may want a degree of match that lies between two limits - binding should occur, but not too strongly; or base-pairs should form, but not too many. We can specify this by asking for a match with a low score, in combination with a match and a high score, both for the same motif, but with the high score included using the NOT operator. The algorithm is designed to find all sections of a sequence that satisfy the pattern rather than only the best match. Particularly if some of the motifs in a pattern are less well defined than others, this can often result in the same region of a sequence being reported as having several matches, but which only vary in the positions of the weakest motifs. General remarks on motif searching Generally motifs are short subsequences that are thought to be associated with particular functions in some known sequences. Often we search for them to try to understand or interpret other sequences. Sometimes we search for motifs and patterns to test a hypothesis about their role: are they found in the expected positions in the expected sequences. In doing so we should remember that, in both proteins and nucleic acids, what we are really looking for is a particular three dimensional structure with certain affinities for other structures, and that we are assuming that the sequence of the motif alone defines the 3D structure we searching for. The overall structure may be completely different to those in which the motif is functional, and hence the motif may have a different shape or be inaccessible. We should be aware of the importance of the context in which a motif is found. Where does it lie relative to the overall structure, is it accessible, is the three dimensional spacing between it and other motifs correct? For example, is it on the same side of the double helix, and the correct distance from some other motif? How does context affect our assessment of the significance of finding a motif? Finding false mammalian mRNA splice junctions in non-coding sequences is far less important than finding false sites in pre-mRNA sequences, but finding them in the correct places is most important! In other words, it is often the case that when we are searching for a motif that is known to be necessary for some function, then a positive result in the form of a match in the required position, is more important than a high background of matches in the wrong positions. Being able to write down the probability of finding a motif in a random sequence tells us how well it is defined. In nucleic acids the DNA may contain many superimposed types of information such as those concerned with histone phasing, protein coding or mRNA secondary structure. These overlapping "codes" may interfere with one another causing matches to motifs to be poorer than expected. In general we will only have a limited number of examples of the motif and we do not know how representative they are. Sequences have superimposed functions: some parts may be of general structural importance and give rise to an overall framework, and other parts give specificity and hence are not common; we may want to use a set of aligned sequences to define a motif, but want to use only the framework positions. Alternatively we may want to pick out only those parts of a set of aligned sequences that give a particular property, and to ignore other similarities that are due to some other property and which could obscure the pattern we are interested in. It is possible to apply a mask to a set of aligned sequences in order to give weight to selected positions only. The ability to define a mask allows certain positions to be used in the motif and others to be ignored, and yet still permits the use of a set of aligned sequences to calculate weights. The mask is requested and applied by the program and results in the masked positions being zero in the weight matrix. The mask is defined in the following way. Suppose we had a motif of length 15, then the mask x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is the dashes (-) that are significant and that positions 1,4,7,8,10,11,12,13,14 and 15 will be non-zero). Of course the same set of sequences could be used with several alternative masks in order to extract different features and create corresponding weight matrices. The programs are described in Staden,R. CABIOS 4, 53-60, 1988; Staden,R. CABIOS 5, 89-96, 1989, anf a forthcoming Methods in Enzymology. @ end of help