.NPA .SP 1 .left margin1 @-1. TX 0 @General .sp @-2. T 0 @Screen control .sp @-2. X 0 @Screen .sp @-3. T 0 @Statistical analysis of content .sp @-3. X 0 @Statistics .sp @-4. T 0 @Structures and repeats .sp @-4. X 0 @Structures .sp @-5. TX 0 @Translation and codons .sp @-6. TX 0 @Gene search by content .sp @-7. TX 0 @General signals .sp @-8. TX 0 @Specific signals .sp @0. TX -1 @NIP .PARA .para This is a program for analysing individual nucleotide sequences. It can read sequences stored in many of the most commonly used formats, and performs all of the usual simple analyses. However the main purpose of the program is to provide methods for finding the function of each section of a sequence. In general no single method can give an unequivecal interpretation of a sequence so we need to use many techniques together and to combine their results. For this reason the program present many of its results graphically. .para General information is contained in the user interface. Online documentation for any function follows a consistent pattern: summary, list of inputs, list of outputs, details, example. .LEFT MARGIN1 @1. TX 0 @ Help .LEFT MARGIN2 .para This option gives online help. The user should select option numbers and the current documentation will be given. Note that option 0 gives an introduction to the program, and that ? will get help from anywhere in the program. The following functions are included: .left margin1 @2. TX 0 @ Quit .left margin2 .para This function stops the program. .left margin1 @3. TX 1 @ Read a new sequence .LEFT MARGIN2 .para This option allows users to read in new sequences, browse through annotations, or search sequence libraries for keywords. Sequences can be read from "personal" sequence files or from sequence libraries. These are referred to as the sequence "source". Personal files can be stored in several formats: Staden, PIR, EMBL, GENBANK and GCG. At LMB we use "Staden" format for sequencing and all the libraries are stored in their original formats. Note, however, that libraries such as EMBL or GenBank that are divided into several files (eg GenBank has 13 separate files) are indexed as a whole. This means that users do not need to know which file contains an entry, only which library. When the user selects to read in a sequence the program first asks for the sequence "source". .para If the user selects "personal" the program will ask for the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of the file. For PIR format the user will also be required to know the entry name of the sequence as the file can contain several. For the other formats only a single entry is expected. The file will be read, its length and composition will be displayed and the option left. .para If the user selects "library" as the sequence source the program will display a list of available libraries. The programs are capable of handling all current libraries but which ones are available will vary from site to site. At LMB we have several libraries and also weekly updates of data gathered between releases. The program will ask users to select a library and then give a list of options: .lit X 1 Get a sequence 2 Get annotations 3 Get entrynames from accession numbers 4 Search titles for keywords 5 Search text index for keywords .end lit If get a sequence or get annotations is selected users will be asked to type the entry name. The option will be left when a sequence is selected or ! is typed. The composition and length will be displayed. .para The text index contains all words from feature tables, reference titles, definition lines, keywords lists and comments, so the text index search is most useful. It is also the fastest. Up to 5 words can be searched for at once. The words should be typed separated by spaces, for example .lit ? Keywords=P53 mouse murine tumo .end lit will search for all entries that contain words starting with p53, mouse, murine and tumo. Only the unique entries that contain ALL words will be listed. Before listing the matching entries the program will show the number of 'hits' for each word and ring the bell. Escape is possible at this point, or after each screenfull of entries. In addition to the entry names the text search displays the primary accession number, the sequence length and up to 80 characters of description. (The search of 'titles' is now redundant because the full text index contains all the title words and the search is much faster. It will probably be removed from the program.) All searches are independent of case. Where possible the program will offer default entry names. .para Typical dialogue follows. .lit Select sequence source X 1 Personal file 2 Sequence library ? Selection (1-2) (1) = Select sequence file format X 1 Staden 2 EMBL 3 GenBank 4 PIR 5 GCG ? Selection (1-5) (1) = ? Sequence file name=M13MP7.SEQ Contig title removed Sequence length= 7238 Sequence composition T C A G - 2405. 1539. 1765. 1527. 2. 33.2% 21.3% 24.4% 21.1% 0.0% . . . Select sequence source X 1 Personal file 2 Sequence library ? Selection (1-2) (1) =2 Select a library X 1 EMBL 29 nucleotide library Dec 91 2 SWISSPROT 20 protein library Nov 91 3 PIR 31 protein library Dec 91 4 NRL3D 58 From Brookhaven protein library Dec 91 5 GenBank ? Selection (1-5) (1) = Library is in EMBL format with indexes Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =5 Search for keywords ? Keywords=P53 mouse P53 hits 68 MOUSE hits 8180 MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon MMANT11 X00885 562 Murine p53 gene 3' region with exon 11 MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end. MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds. MMP53 X00741 1377 Mouse mRNA for transformation associated protein MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53. MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11. MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8. MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4 MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53 MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53 MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5 22 different entries found Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =4 Search for keywords ? Keywords=alpha Searching for alpha AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank. ADHBADA1 102 duck alpha-d-globin gene, exon 1. ADHBADA2 1145 duck alpha-a-globin gene and 5' flank ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank. AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide. AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b AGAMYLS 120 synthetic alpha-amylase gene, 5' end. AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han ! Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =3 ? Accession number=v00636 Entry name LAMBDA Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) =2 Default Entry name=LAMBDA ? Entry name= ID LAMBDA standard; DNA; PHG; 48502 BP. XX AC V00636; J02459; M17233; X00906; XX DT 03-JUL-1991 (Rel. 28, Last updated, Version 3) DT 09-JUN-1982 (Rel. 1, Created) XX DE Genome of the bacteriophage lambda (Styloviridae). XX KW circular; coat protein; DNA binding protein; genome; KW origin of replication. XX OS Bacteriophage lambda OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae. XX RN [1] RP 1-48502 RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.; RT "Nucleotide sequence of bacteriophage lambda DNA"; RL J. Mol. Biol. 162:729-773(1982). XX ! Select a task X 1 Get a sequence 2 Get annotations 3 Get entry names from accession numbers 4 Search titles for keywords 5 Search text index for keywords ? Selection (1-5) (1) = Default Entry name=LAMBDA ? Entry name= DE Genome of the bacteriophage lambda (Styloviridae). Sequence length 48502 Sequence composition T C A G - 11988. 11360. 12336. 12818. 0. 24.7% 23.4% 25.4% 26.4% 0.0% .end lit .left margin1 @4. TX 1 @ Define active region .LEFT MARGIN2 .para For its analytic functions the program always works on a region of the sequence called the "active region". This function allows the start and end points of the active region to be reset. .para Define the required start and end points. .para When a new sequence is read into the program the active region is automatically set to start at the beginning of the sequence and extend to the maximum the program can handle. On most machines this will be to the end of the sequence. The positions are shown on the screen. Note that for convenience, in the listing and translation functions, the user is given access to regions outside the active region. .left margin1 @5. TX 1 @ List a sequence .LEFT MARGIN2 .para The sequence can be listed single or double stranded with line lengths from 10 to 120 in multiples of 10. .para Define the region to list, the line length required and choose between a single or double stranded display. The output looks like: .lit GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA 10 20 30 40 50 AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT 60 70 80 90 100 GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT 110 120 130 140 150 AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA 160 170 180 190 200 .end lit .left margin1 @6. TX 1 @ List a text file. .LEFT MARGIN2 .para Allows the user to have a text file displayed on the screen. It will appear one page at a time. .para Supply the name of the file to be displayed. .left margin1 @7. TX 1 @ Direct output to disk .LEFT MARGIN2 .para Used to direct output that would normally appear on the screen to a file. .para Select redirection of either text or graphics, and supply the name of the file that the output should be written to. .para The results from the next options selected will not appear on the screen but will be written to the file. When option 7 is selected again the file will be closed and output will again appear on the screen. .left margin1 @8. TX 1 @ Write active region to disk .LEFT MARGIN2 .para Used to write the current active section of sequence to a disk file in "Staden format". .para Supply a file name and an optional title. .para The program has the capability of reading sequences stored in several formats and so, in conjunction with this option, can be used to reformat them. .left margin1 @9. TX 1 @ Edit the sequence .LEFT MARGIN2 .para Used to edit sequences or any other files by giving access to the computers system editor. For editing sequences the input file should have already been created using one of the listing functions such as "list sequence", "list translation" or "list restriction sites above the sequence". .para Supply the name of the file to edit. Wait while the system editor is made ready (can take awhile on a vax). Use the editor. Exit from the editor. If a sequence has been edited, and you want to process it, affirm that the sequence should be "made active". The edited sequence will replace the original sequence. .para This editing method is designed to give users access to an editor with which they are familiar - i.e. the one on their machine, and yet to allow them to edit a sequence which contains all the landmarks they need in order to know where they are. Users can create files containing simple listings (single stranded) with numbering, using "list the sequence", and then edit them with their system editor, using the numbering to know where they are within the sequence. When the edits are complete they exit from the editor and the program "analyses" the edited file to extract only the sequence characters. Similarly a file containing a three phase tranlslation can be edited, or a file containing a sequence plus its three phase translation, plus its restriction sites marked above the sequence. In order to be able to "analyse" such complicated listings and correctly extract the sequence the following simple rule is used: all lines in the file that contain a character that is not A,C,T,G or U are deleted. It is obviously important to be aware of this rule and its implications. .left margin1 @10. TX 2 @ Clear graphics .LEFT MARGIN1 .para Clears graphics from the screen. .left margin1 @11. TX 2 @ Clear text .LEFT MARGIN1 .para Clears text from the screen. .left margin1 @12. TX 2 @ Draw a ruler .LEFT MARGIN2 .para This option allows the user to draw a ruler or scale along the x axis of the screen to help identify the coordinates of points of interest. The user can define the position of the first base to be marked (for example if the active region is 1501 to 8000, the user might wish to mark every 1000th base starting at either 1501 or 2000 - it depends if the user wishes to treat the active region as an independent unit with its own numbering starting at its left edge, or as part of the whole sequence). The user can also define the separation of the ticks on the scale and their height. If required the labelling routine can be used to add numbers to the ticks. .left margin1 @13. TX 2 @ Use crosshair .LEFT MARGIN2 .para This function puts a steerable cross on the screen that can be used to find the coordinates of points in the sequence. The user can move the cross around using the directional keys; when he hits the space bar the program will print out the coordinates of the cross in sequence units and the option will be exited. .PARA If instead, you hit a , the position will be displayed but the cross will remain on the screen. .PARA If a letter s is hit the program will display the sequence around the crosshair position, and leave the cross on the screen. .left margin1 @14. TX 2 @ Reposition plots .LEFT MARGIN2 .para The positions of each of the plots is defined relative to a users drawing board which has size 1-10,000 in x and 1-10,000 in y. Plots for each option are drawn in a window defined by x0,y0 and xlength,ylength. Where x0,y0 is the position of the bottom left hand corner of the window, and xlength is the width of the window and ylength the height of the window. .lit --------------------------------------------------------- 10,000 1 1 1 -------------------------------------- ^ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ylength 1 1 1 1 1 1 1 1 1 1 1 1 -------------------------------------- v 1 1 x0,y0^ 1 1 <---------------xlength--------------> 1 --------------------------------------------------------- 1 1 10,000 .end lit All values are in drawing board units (i.e. 1-10,000, 1-10,000). The default window positions are read from a file "NIPMARG" when the program is started. Users can have their own file if required. As all the plots start at the same position in x and have the same width, x0 and xlength are the same for all options. Generally users will only want to change the start level of the window y0 and its height ylength. This option allows users to change window positions whilst running the program. The routine prompts first for the number of the option that the users wishes to reposition; then for the y start and height; then for the x start and length. Note that changes to the x values affect all options. If the user types only carriage return for any value it will remain unchanged. The cross-hair can be used to choose suitable heights. .LEFT MARGIN1 @15. TX 2 @ Label a diagram .LEFT MARGIN2 .para This routine allows users to label any diagrams they have produced. They are asked to type in a label. When the user types carriage return to finish typing the label the cross-hair appears on the screen. The user can position it anywhere on the screen. If the user types R (for right justify) the label will be written on the diagram with its right end at the cross-hair position. If the user types L (for left justify) the label will be written on the diagram with its left end at the cross hair position. The cross-hair will then immediately reappear. The user may put the same label on another part of the diagram as before or if he hits the space bar he will be asked if he wishes to type in another label. .para Typical dialogue follows. .lit ? Menu or option number=15 Type label then drive cross hair to left or right end of label position then hit "L" to write label left justified or "R" to write label right justified or the space bar to quit ? Label=delta gene missing graphics ? Label= .end lit .left margin1 @16. TX 2 @Display a map .LEFT MARGIN2 .para This draws a map of any sequence features selected by the user. These features may be protein coding regions (CDS), tRNA genes (TRNA), promoter positions (PRM), etc. Users may define their own feature table key names. For example I find it convenient to split CDS lines into CDS1, CDS2 and CDS3 each of which contains only those sequences that code in the reading frames 1, 2 or 3. Then I can plot them at different heights on the screen ( suitable heights can be determined by using the cross-hair). .para The coordinates must be stored in a file in the format of an EMBL or GenBank feature table. Note that this means that the file must include either EMBL or GenBank headers, and a suitable "tail". The simplest header is the word FEATURES starting in column 1 of the first line of the file. The simplest tail is 2 empty lines at the end of the file. These lines are not included when nip writes out results in feature table format. .para Typical dialogue follows. .lit ? Menu or option number=16 Display a map using an EMBL feature table file ? map file name=hsegl1.ft ? feature code(e.g. CDS) =CDS X 1 + strand 2 - strand 3 both strands ? 0,1,2,3 = ? level (0-9480) (256) =4000 missing graphics ? feature code(e.g. CDS) = .end lit .left margin1 @17. TX 1 @ Search for restriction enzymes .LEFT MARGIN2 .para This routine is used to search for short sequences, like restriction enzyme recognition sequences, and can either list the results or present them graphically. Listings can take several forms and can include the sequence and its translation. Examples are given below. The program will also display the names of enzymes that cut the sequence infrequently. Users can select from sets of enzymes stored in files or can enter them from the keyboard. .para The short sequences (strings) and their names need to be arranged in a particular way. See below. Select to search, list an enzyme file or clear the screen. Choose either a file of enzymes or to enter their recognition sequences at the keyboard. Choose to search for all the enzymes in the list or to select from the list. Select a mode of output. Define the sequence as circular or linear. Select to search for "definite" or "possible" matches. The search starts, and after the results have been displayed, further searches can be performed. .para When the enzymes and their recognition sequences are stored in a file they must be defined in the following way. We call the recognition sequences "strings". The format is as follows: each string or set of strings must be preceded by a name, each string must be preceded and terminated with a slash (/), and each set of strings by 2 slashes. For example AATII/GACGT'C// defines the name AATII, its recognition sequence GACGTC and its cut site with the ' symbol; ACCI/GT'MKAC// defines the name ACCI and its recognition sequence includes IUB symbols for incompletely defined symbols in nucleic acid sequences; BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC// defines the name BBVI and this time two recognition sequences and cut sites are specified in order to correctly show the cutting position relative to the recognition sequence. If no cut site is included the first base of the recognition sequence is displayed as being on the 3' side of the recognition sequence. .para These collections of strings and their names can be read from disk or entered from the keyboard. When names and strings are entered from the keyboard the program will ask for the name and then the string(s). If more than one string is typed per name they must be separated by slash (/) characters. See the "Typical dialogue" below. Three files containing restriction enzyme recognition sequences are currently available. The "all enzymes" file contains the Rich Roberts REBASE restriction enzyme database, which is updated monthly. .para The user can select strings by name from these collections. If so the program will prompt for the names, one at a time. The user can continue to select names until a blank name is entered (by the user typing only return). .para Listed output can be displayed in several ways: it can be ordered enzyme by enzyme, or on cut positions, or with enzyme names written above a listing of the sequence. This last listing can also include a three phase translation of the sequence. In addition the program will display only infrequent cutters (the user defines the minimum number of cuts), or can plot the positions of matches. .para Listings sorted "enzyme by enzyme" have the following form: .lit Matches found= 1 Name Sequence Position Fragment lengths 1 AATII GACGT'C 112 111 111 912 912 Matches found= 2 Name Sequence Position Fragment lengths 1 ACCI GT'CGAC 112 111 111 2 ACCI GT'AGAC 420 308 308 604 604 Matches found= 2 Name Sequence Position Fragment lengths 1 AHAII GA'CGTC 109 108 90 2 AHAII GG'CGTC 199 90 108 825 825 Matches found= 2 Name Sequence Position Fragment lengths 1 AVAII G'GACC 84 83 51 2 AVAII G'GTCC 973 889 83 51 889 Matches found= 1 Name Sequence Position Fragment lengths 1 BALI TGG'CCA 258 257 257 766 766 Matches found= 1 Name Sequence Position Fragment lengths 1 BAMHI G'GATCC 92 91 91 ...... etc Listings sorted on cut position have the following form: Searching Name Sequence Position Fragment lengths 1 ECORI G'AATTC 2 1 2 BANI G'GTGCC 26 24 3 BSP1286 GTGCC'C 31 5 4 BBVI 'TACTGCGCCGCAGCTGC 38 7 5 NSPBII CAG'CTG 51 13 6 PVUII CAG'CTG 51 0 7 BBVI GCAGCTGCTGGTG' 60 9 8 HINCII GTC'AAC 80 20 9 AVAII G'GACC 84 4 10 BINI 'CCAGGGATCC 87 3 11 BSTNI CC'AGG 89 2 12 BAMHI G'GATCC 92 3 13 XHOII G'GATCC 92 0 14 NSPBII CCG'CTG 98 6 15 BINI GGATCCGCT' 100 2 16 AHAII GA'CGTC 109 9 17 SALI G'TCGAC 111 2 18 AATII GACGT'C 112 1 19 ACCI GT'CGAC 112 0 20 HINCII GTC'GAC 113 1 21 BBVI GCAGCGACTGATT' 166 53 22 BINI 'ACTCAGATCC 178 12 23 XHOII A'GATCC 183 5 24 HGAI 'GGCGGCGGAGGCGTC 188 5 .....etc Lists of infrequent cutters have the following form: 0 AFLII 0 AFLIII 0 APAI 0 APALI 0 ASUII 0 AVAI 0 AVRII 0 BCLI 0 BGLI 0 BGLII 0 BSMI 0 BSPMII 0 BSTEII ...... etc Listings showing names above the sequence, and a translation have the following form: ECORI BANI BSP1286 . . . BBVI NSPBII . . . . PVUII BBVI GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG GTGC 10 20 30 40 50 60 E F G L G L V * G A Q R L L R R S C W C N S V W A W C E V P R D Y C A A A A G A I R F G L G V R C P E I T A P Q L L V L HINCII . AVAII . . BINI . . . BSTNI . . . . BAMHI . . . . XHOII NSPBII . . . . . . BINI AHAII . . . . . . . . SALI . . . . . . . . .AATII . . . . . . . . .ACCI . . . . . . . . ..HINCII TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG ACGA 70 80 90 100 110 120 W R C G G R Q R T Q G S A G R G R R R R G G A E V V N G P R D P L D E D V D D E A V R R S S T D P G I R W T R T S T T R BBVI BINI GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC CACT 130 140 150 160 170 180 G G G G * R I A G G W Q R L I * V L T T E E V D S A L L V A G S D * F E F * P L R R W I A H C W W L A A T D L S S D H S XHOII . HGAI AHAII PFIMI . . . . BBVI CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA GATT 190 200 210 220 230 240 Q I R R R R R R G S R N S Q W L A C * I R S G G G G V E A P E T P S G W P A R F D P A A E A S R L P K L P V A G L L D S .........etc .end lit .para The terms "possible" and "definite" matches are important only for back translations of protein into DNA, and which include IUB redundancy codes. Those matches that the program terms "definite matches" and are ones in which the specification of the recognition sequence corresponds exactly to that of the back translation, and consequently are definitely in the DNA sequence. The program will also find what it terms 'possible matches' which are ones that depend on the particular codons chosen for each amino acid. These are sites at which recognition sequences could be engineered to produce a cut in the DNA without changing the amino acid, but which are not necessarily found in the original sequence. .para The routine will handle both linear and circular sequences, and so finds cutsites spanning the "ends" of circular sequences. The program will only find cutsites spanning the ends of sequences if the sequence is declared as circular. This includes sites for recognition sequences containing leading or trailing N symbols, in which the actual recognition sequence does not span the join. For example if the recognition sequence was 'NNNNACGT and the first 4 characters in the sequence were ACGT, then the match would only be found if the sequence was declared as circular. If the sequence is linear then the first fragment starts at base number 1, and the last ends at the last base. If the sequence is circular then the length of the first fragment is the clockwise distance from the last cut to the first. .para Graphical output marks the position of each string by a short vertical line and gives the name of the enzyme at the left end of the line. If the top of the screen is reached the program gives the user the oportunity to take a hard copy and then will clear the screen and restart plotting results at the original start position. .para Below is an edited piece of dialogue from use of the search option: .lit ? Menu or option number=17 Search for restriction enzyme sites X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 2 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 = AATII/GACGT'C// ACCI/GT'MKAC// AFLII/C'TTAAG// AFLIII/A'CRYGT// AHAII/GR'CGYC// APAI/GGGCC'C// APALI/G'TGCAC// ASUII/TT'CGAA// AVAI/C'YCGRG// AVAII/G'GWCC// AVRII/C'CTAGG// BALI/TGG'CCA// BAMHI/G'GATCC// BANI/G'GYRCC// BANII/GRGCY'C// BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC// BCLI/T'GATCA// BGLI/GCCNNNN'NGGC// BGLII/A'GATCT// BINI/GGATCNNNN'/'NNNNNGATCC// BSMI/GAATGCN'/NG'CATTC// BSP1286/GDGCH'C// X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 = ? (y/n) (y) Search for all names X 1 Order results enzyme by enzyme 2 Order results by position 3 Show only infrequent cutters 4 Show names above the sequence ? 0,1,2,3,4 = ? (y/n) (y) List matches ? (y/n) (y) The sequence is linear ? (y/n) (y) Search for definite matches Searching Matches found= 1 Name Sequence Position Fragment lengths 1 AATII GACGT'C 112 111 111 912 912 Matches found= 2 Name Sequence Position Fragment lengths 1 ACCI GT'CGAC 112 111 111 2 ACCI GT'AGAC 420 308 308 604 604 Matches found= 2 Name Sequence Position Fragment lengths 1 AHAII GA'CGTC 109 108 90 2 AHAII GG'CGTC 199 90 108 825 825 Matches found= 2 Name Sequence Position Fragment lengths 1 AVAII G'GACC 84 83 51 2 AVAII G'GTCC 973 889 83 51 889 Matches found= 1 Name Sequence Position Fragment lengths 1 BALI TGG'CCA 258 257 257 766 766 Matches found= 1 Name Sequence Position Fragment lengths 1 BAMHI G'GATCC 92 91 91 932 932 Matches found= 1 Name Sequence Position Fragment lengths 1 BANI G'GTGCC 26 25 25 998 998 Matches found= 1 Name Sequence Position Fragment lengths 1 BANII GAGCC'C 490 489 489 534 534 Matches found= 11 Name Sequence Position Fragment lengths 1 BBVI 'TACTGCGCCGCAGCTGC 38 37 3 2 BBVI GCAGCTGCTGGTG' 60 22 22 3 BBVI GCAGCGACTGATT' 166 106 28 4 BBVI 'CCTGCTAGATTCGCTGC 230 64 37 5 BBVI GCAGCGGTACGTA' 452 222 50 6 BBVI 'CTCGCCAACGTTGCTGC 502 50 55 7 BBVI GCAGCCTTCAACT' 606 104 64 8 BBVI 'GAGGTATTCCTGGCTGC 634 28 97 9 BBVI 'CTGGCCGCCGCCGCTGC 869 235 104 10 BBVI 'GCCGCCGCCGCTGCTGC 872 3 106 11 BBVI GCAGCGATGAGGA' 927 55 222 ....etc X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 = ? (y/n) (y) Search for all names X 1 Order results enzyme by enzyme 2 Order results by position 3 Show only infrequent cutters 4 Show names above the sequence ? 0,1,2,3,4 = 2 ? (y/n) (y) List matches ? (y/n) (y) The sequence is linear ? (y/n) (y) Search for definite matches Searching Name Sequence Position Fragment lengths 1 ECORI G'AATTC 2 1 2 BANI G'GTGCC 26 24 3 BSP1286 GTGCC'C 31 5 4 BBVI 'TACTGCGCCGCAGCTGC 38 7 5 NSPBII CAG'CTG 51 13 6 PVUII CAG'CTG 51 0 7 BBVI GCAGCTGCTGGTG' 60 9 8 HINCII GTC'AAC 80 20 9 AVAII G'GACC 84 4 10 BINI 'CCAGGGATCC 87 3 11 BSTNI CC'AGG 89 2 12 BAMHI G'GATCC 92 3 13 XHOII G'GATCC 92 0 14 NSPBII CCG'CTG 98 6 15 BINI GGATCCGCT' 100 2 16 AHAII GA'CGTC 109 9 17 SALI G'TCGAC 111 2 18 AATII GACGT'C 112 1 19 ACCI GT'CGAC 112 0 20 HINCII GTC'GAC 113 1 .....etc X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 = ? (y/n) (y) Search for all names 1 Order results enzyme by enzyme X 2 Order results by position 3 Show only infrequent cutters 4 Show names above the sequence ? 0,1,2,3,4 =3 ? Maximum number of cuts (0-100) (0) = ? (y/n) (y) The sequence is linear ? (y/n) (y) Search for definite matches Searching 0 AFLII 0 AFLIII 0 APAI 0 APALI 0 ASUII 0 AVAI 0 AVRII 0 BCLI 0 BGLI 0 BGLII 0 BSMI 0 BSPMII 0 BSTEII 0 CLAI 0 DRAI 0 DRAII 0 ECOB 0 ECOK 0 ECORV 0 ESPI ......etc X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 = ? (y/n) (y) Search for all names 1 Order results enzyme by enzyme 2 Order results by position X 3 Show only infrequent cutters 4 Show names above the sequence ? 0,1,2,3,4 =4 ? (y/n) (y) Hide translation n ? (y/n) (y) Use 1 letter codes ? Line length (30-90) (60) = ? (y/n) (y) The sequence is linear ? (y/n) (y) Search for definite matches Searching ECORI BANI BSP1286 . . . BBVI NSPBII . . . . PVUII BBVI GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG GTGC 10 20 30 40 50 60 E F G L G L V * G A Q R L L R R S C W C N S V W A W C E V P R D Y C A A A A G A I R F G L G V R C P E I T A P Q L L V L HINCII . AVAII . . BINI . . . BSTNI . . . . BAMHI . . . . XHOII NSPBII . . . . . . BINI AHAII . . . . . . . . SALI . . . . . . . . .AATII . . . . . . . . .ACCI . . . . . . . . ..HINCII TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG ACGA 70 80 90 100 110 120 W R C G G R Q R T Q G S A G R G R R R R G G A E V V N G P R D P L D E D V D D E A V R R S S T D P G I R W T R T S T T R BBVI BINI GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC CACT 130 140 150 160 170 180 G G G G * R I A G G W Q R L I * V L T T E E V D S A L L V A G S D * F E F * P L R R W I A H C W W L A A T D L S S D H S .......etc X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 =5 Define search strings by typing a string name followed by the string(s) ? Name=FRED ? String(s)=AAAAAA/TTTTTT ? Name=MARY ? String(s)=CCCC/GGGG/GCGCT ? Name= ? (y/n) (y) Search for all names X 1 Order results enzyme by enzyme 2 Order results by position 3 Show only infrequent cutters 4 Show names above the sequence ? 0,1,2,3,4 = ? (y/n) (y) List matches ? (y/n) (y) The sequence is linear ? (y/n) (y) Search for definite matches Searching Matches found= 9 Name Sequence Position Fragment lengths 1 FRED 'TTTTTT 1557 1556 1 2 FRED 'TTTTTT 1558 1 1 3 FRED 'TTTTTT 1559 1 1 4 FRED 'TTTTTT 1560 1 22 5 FRED 'AAAAAA 1582 22 529 6 FRED 'AAAAAA 3160 1578 1019 7 FRED 'AAAAAA 4204 1044 1044 8 FRED 'AAAAAA 5691 1487 1487 9 FRED 'AAAAAA 6710 1019 1556 529 1578 Matches found= 36 Name Sequence Position Fragment lengths 1 MARY 'CCCC 47 46 1 2 MARY 'GGGG 486 439 1 3 MARY 'GGGG 487 1 1 4 MARY 'CCCC 557 70 1 5 MARY 'CCCC 558 1 1 6 MARY 'GCGCT 1177 619 1 ... etc X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 = 1 All enzymes X 2 Six cutters 3 Four cutters 4 Personal file 5 Keyboard ? 0,1,2,3,4,5 =5 Define search strings by typing a string name followed by the string(s) ? Name=JANE ? String(s)=A'TTTT/CC'GGG ? Name= ? (y/n) (y) Search for all names X 1 Order results enzyme by enzyme 2 Order results by position 3 Show only infrequent cutters 4 Show names above the sequence ? 0,1,2,3,4 = ? (y/n) (y) List matches ? (y/n) (y) The sequence is linear ? (y/n) (y) Search for definite matches Searching Matches found= 30 Name Sequence Position Fragment lengths 1 JANE A'TTTT 437 436 6 2 JANE A'TTTT 546 109 33 3 JANE A'TTTT 597 51 43 4 JANE A'TTTT 777 180 51 5 JANE A'TTTT 1274 497 60 6 JANE A'TTTT 1571 297 62 7 JANE CC'GGG 1926 355 75 8 JANE A'TTTT 2403 477 81 9 JANE A'TTTT 2586 183 82 10 JANE A'TTTT 2731 145 101 11 JANE A'TTTT 2812 81 103 ... etc X 1 Search 2 List enzyme file 3 Clear text 4 Clear graphics ? 0,1,2,3,4 =! .end lit .left margin1 @18. TX 1 7 @ Compare a short sequence .LEFT MARGIN2 .para This routine slides a short sequence along the current sequence and finds all positions at which a given percentage of the bases match. Output is in both graphical and listed forms. .para If users call for dialogue when the routine is selected they will be given the choice of keyboard or file input. Define the string, select the "sense" to use and the percentage match. Matches will be plotted out and then the user can select to have them listed. Then the routine cycles around. .para The routine slides the search string along the sequence and marks the positions at which a minimum percentage score is reached. The graphical output draws a vertical line at the match position; the height of the line represents the percentage score, so that if the line reaches the top of the box the score is 100%. The NC-IUB symbols may be used in the search sequence to encode uncertain characters. Any other symbols will not match. .LIT NC-IUB SYMBOLS A,C,G,T R (A,G) 'puRine' Y (T,C) 'pYrimidine' W (A,T) 'Weak' S (C,G) 'Strong' M (A,C) 'aMino' K (G,T) 'Keto' H (A,T,C) 'not G' B (G,C,T) 'not A' V (G,A,C) 'not T' D (G,A,T) 'not C' N (G,A,C,T) 'aNy' Typical dialogue is shown below. ? Menu or option number=18 Find percentage matches ? (y/n) (y) Keep picture ? String=AAATTTCCC STRING=AAATTTCCC ? (y/n) (y) This sense ? Percent match (1.00-100.00) (70.00) = Missing graphics display here Total scoring positions above 70.000 percent = 7 Scores 7 6 6 6 6 6 6 Positions 365 212 213 292 311 358 627 ? Display (0-7) (0) =3 365 ACATTTCGC * ***** * AAATTTCCC 1 212 GAAACTCCC ** **** AAATTTCCC 1 213 AAACTCCCA *** * ** AAATTTCCC 1 ? (y/n) (y) Keep picture Default String=AAATTTCCC ? String= STRING=AAATTTCCC ? (y/n) (y) This sense n STRING=GGGAAATTT ? Percent match (1.00-100.00) (70.00) = Missing graphics display here Total scoring positions above 70.000 percent = 7 Scores 6 6 6 6 6 6 6 Positions 269 270 271 288 354 624 853 ? Display (0-7) (0) =3 269 GAGGGATTT * * **** GGGAAATTT 1 270 AGGGATTTT ** * *** GGGAAATTT 1 271 GGGATTTTC **** ** GGGAAATTT 1 ? (y/n) (y) Keep picture ! .end lit .left margin1 @19. TX 7 @ Compare a short sequence using a score matrix .LEFT MARGIN2 .para This routine slides a short sequence along the current sequence and finds all positions at which a given level of similarity (a cutoff score) is reached. The score is defined by use of a score matrix. Output is in both graphical and listed forms. .para If users call for dialogue when the routine is selected they will be given the choice of keyboard or file input. Define the string, select the "sense" to use and the cutoff score. Matches will be plotted out and then the user can select to have them listed. Then the routine cycles around. .para The routine slides the search string along the sequence and marks the positions at which a the cutoff score is achieved. The graphical output draws a vertical line at the match position; the height of the line represents the score, so that if the line reaches the top of the box the score is the maximum possible. The NC-IUB symbols may be used in the search sequence to encode uncertain characters. .para The score matrix reflects the level of redundancy in the probe sequence and hence will put more emphasis on those characters that are better defined. The score matrix is: .lit DNA SCORE MATRIX USING IUB SYMBOLS T C A G - R Y W S M K H B V D N ? T 36 0 0 0 9 0 18 18 0 0 18 12 12 0 12 9 0 C 0 36 0 0 9 0 18 0 18 18 0 12 12 12 0 9 0 A 0 0 36 0 9 18 0 18 0 18 0 12 0 12 12 9 0 G 0 0 0 36 9 18 0 0 18 0 18 0 12 12 12 9 0 - 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0 R 0 0 18 18 18 36 0 9 9 9 9 6 6 12 12 18 0 Y 18 18 0 0 18 0 36 9 9 9 9 12 12 6 6 18 0 W 18 0 18 0 18 9 9 36 0 9 9 12 6 6 12 18 0 S 0 18 0 18 18 9 9 0 36 9 9 6 12 12 6 18 0 M 0 18 18 0 18 9 9 9 9 36 0 12 6 12 6 18 0 K 18 0 0 18 18 9 9 9 9 0 36 6 12 6 12 18 0 H 12 12 12 0 27 6 12 12 6 12 6 36 8 8 8 27 0 B 12 12 0 12 27 6 12 6 12 6 12 8 36 8 8 27 0 V 0 12 12 12 27 12 6 6 12 12 6 8 8 36 8 27 0 D 12 0 12 12 27 12 6 12 6 6 12 8 8 8 36 27 0 N 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0 ? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ? is any unrecognised character. Typical dialogue is shown below. ? Menu or option number=19 Find matches using a score matrix ? (y/n) (y) Keep picture ? String=AAATTTCCC STRING=AAATTTCCC ? (y/n) (y) This sense Minimum score= 0 Maximum score= 324 ? Score (0-324) (280) =250 Missing graphics display here For score 250 the number of matches= 1 Scores 252 Positions 365 ? Display (0-1) (0) =1 365 ACATTTCGC * ***** * AAATTTCCC 1 ? (y/n) (y) Keep picture Default String=AAATTTCCC ? String= STRING=AAATTTCCC ? (y/n) (y) This sense n STRING=GGGAAATTT Minimum score= 0 Maximum score= 324 ? Score (0-324) (222) = 200 Missing graphics display here For score 200 the number of matches= 7 Scores 216 216 216 216 216 216 216 Positions 269 270 271 288 354 624 853 ? Display (0-7) (0) =3 269 GAGGGATTT * * **** GGGAAATTT 1 270 AGGGATTTT ** * *** GGGAAATTT 1 271 GGGATTTTC **** ** GGGAAATTT 1 ? (y/n) (y) Keep picture ! .end lit .left margin1 @20. TX 7 @ Search for a motif using a weight matrix .LEFT MARGIN2 .para This function performs searches for short sequence motifs using an appropriate weight matrix. In addition it can be used to create or modify weight matrices. In order to perform a search the only input required is the name of the file containing the weight matrix. The results can be presented graphically or listed. The graphical presentation will draw line at the position of any matches found; the height of the line is proportional to the score. .para For a search, select "use weight matrix", supply the name of the file containing the weight matrix, and choose between having results plotted or listed. If dialogue is requested when the function is selected users can alter the cutoff score employed. .para To create a weight matrix several steps are involved. A file containing an alignment of known motifs is required. (This file must be created before the current option is selected. The format is a follows: each sequence is written on a separate line with at least one space at the beginning; each sequence is terminated by a space character, and can be followed by a name. The sequences must be aligned.) Supply the name of the file of aligned sequences. The program reads and displays the sequences. Choose between "summing logs of weights" or summing weights (i.e. whether to multiply or add weights). If logs are used all scores will be negative. Choose if all positions in the set of aligned sequences should be used or if a mask should be applied. If so selected, define a mask as a string of symbols, in which symbol - means ignore and any other symbol means use. E.g. xx-x--abc means use all positions except 3,5 and 6. .para The program will calculate weights as the frequencies of each base at each unmasked position in the set of aligned sequences. These weights are then applied to the set of aligned sequences to give a range of "observed" scores. The mean and standard deviation of these scores is displayed. The user is asked to supply several values to be used when the weight matrix is applied to other sequences: a cutoff score (by default, the mean minus 3 standard deviations), a top score for scaling graphical results (by default, the mean plus 3 standard deviations), and a position to identify (this means that if a particular base within the motif is used as a "landmark", such as the A of the AG in splice acceptor sites, then its position will be marked in plots). All these values are stored along with the weight matrix. Finally supply the name of a file to contain the weight matrix. .para Weight matrices can be "rescaled" using a set of aligned sequences in much the same ways as a matrix is created. The purpose is to redefine the cutoff scores, and rescaling does not alter any other values in the weight matrix file. .para The methods have changed considerably but were first outlined in Staden, R. Nucl. Acid Res. 12 505-519 1984, and Staden, R. Genetic engineering: principles and methods vol 7, Edited by J.K. Setlow and A. Hollaender, Plenum publishing corp., 1985. .para The methods have always had to deal with the problem of zeroes in the matrices. The current versions employ "Laplaces Law of Succession" in which 1 is added to each term. .para It is now possible to apply a mask to a set of aligned sequences in order to give weight to selected positions only. Sequences have superimposed functions: some parts may be of general structural importance and give rise to an overall framework, and other parts give specificity and hence are not common; we may want to use a set of aligned sequences to define a motif, but want to use only the framework positions. Alternatively we may want to pick out only those parts of a set of aligned sequences that give a particular property, and to ignore other similarities that are due to some other property and which could obscure the pattern we are interested in. The ability to define a mask allows certain positions to be used in the motif and others to be ignored, and yet still permits the use of a set of aligned sequences to calculate weights. .para Typical dialogue is shown below. .lit ? Menu or option number=20 X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 =2 ? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ 1 AGCGTGACTCTTCCCGGAA HIS1 2 GAGGTGACTCACTTGGAAG HIS1 3 CGGATGACTCTTTTTTTTT HIS3 4 ACAGTGACTCACGTTTTTT HIS4 5 GTCGTGACTCATATGCTTT ARG3 6 TGAATGACTCACTTTTTGG ARG4 7 TTCTTGACTCGTCTTTTCT CPA1 8 CGAATGACTCTTATTGATG CPA2 9 AGAATGACTAATTTTACTA TRP5 10 TCGTTGACTCATTCTAATC TRP3 11 TTGCTGACTCATTACGATT TRP2 12 GAGATGACTCTTTTTCTTT IV1 13 GCGATGATTCATTTCTCTG IV2 14 TAGATGACTCAGTTTAGTC LEU1 15 TAAGTGACTCAGTTCTTTC LEU4 16 ATGATGACTCTTAAGCATG ILS1 Length of motif 19 ? (y/n) (y) Sum logs of weights ? (y/n) (y) Use all motif positions n x means use, - means ignore e.g. xx-x---x-x means use positions 1,2,4,8,10 ? Mask=----XXXXXXXX Applying weights to input sequences 1 -27.979 AGCGTGACTCTTCCCGGAA 2 -24.543 GAGGTGACTCACTTGGAAG 3 -20.890 CGGATGACTCTTTTTTTTT 4 -23.087 ACAGTGACTCACGTTTTTT 5 -22.771 GTCGTGACTCATATGCTTT 6 -23.408 TGAATGACTCACTTTTTGG 7 -25.159 TTCTTGACTCGTCTTTTCT 8 -22.679 CGAATGACTCTTATTGATG 9 -24.751 AGAATGACTAATTTTACTA 10 -23.157 TCGTTGACTCATTCTAATC 11 -23.067 TTGCTGACTCATTACGATT 12 -21.449 GAGATGACTCTTTTTCTTT 13 -24.191 GCGATGATTCATTTCTCTG 14 -23.770 TAGATGACTCAGTTTAGTC 15 -22.923 TAAGTGACTCAGTTCTTTC 16 -25.285 ATGATGACTCTTAAGCATG Top score -20.890 Bottom score -27.979 Mean -23.694 Standard deviation 1.613 Mean minus 3.sd -28.534 Mean plus 3.sd -18.854 ? Cutoff score (-999.00-9999.00) (-28.53) = ? Top score for scaling plots (-28.53-999.00) (-18.85) = ? Position to identify (0-19) (1) = ? Title=GCN4 SEQUENCES ? Name for new weight matrix file=1.WTS ? Menu or option number=20 X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 =3 ? Name of existing weight matrix file=1.WTS GCN4 SEQUENCES ? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ Length of motif 19 ? (y/n) (y) Sum logs of weights n ? (y/n) (y) Use all motif positions Applying weights to input sequences 1 128.000 AGCGTGACTCTTCCCGGAA 2 148.000 GAGGTGACTCACTTGGAAG 3 172.000 CGGATGACTCTTTTTTTTT 4 160.000 ACAGTGACTCACGTTTTTT 5 161.000 GTCGTGACTCATATGCTTT 6 157.000 TGAATGACTCACTTTTTGG 7 149.000 TTCTTGACTCGTCTTTTCT 8 160.000 CGAATGACTCTTATTGATG 9 151.000 AGAATGACTAATTTTACTA 10 159.000 TCGTTGACTCATTCTAATC 11 158.000 TTGCTGACTCATTACGATT 12 169.000 GAGATGACTCTTTTTCTTT 13 152.000 GCGATGATTCATTTCTCTG 14 157.000 TAGATGACTCAGTTTAGTC 15 160.000 TAAGTGACTCAGTTCTTTC 16 143.000 ATGATGACTCTTAAGCATG Top score 172.000 Bottom score 128.000 Mean 155.250 Standard deviation 10.034 Mean minus 3.sd 125.147 Mean plus 3.sd 185.353 ? Cutoff score (-999.00-9999.00) (125.15) = ? Top score for scaling plots (125.15-999.00) (185.35) = ? Position to identify (0-19) (1) = ? Title=GCN4 SEQUENCES ? Name for new weight matrix file=2.WTS ? Menu or option number=20 X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 = ? Motif weight matrix file=1.WTS GCN4 SEQUENCES ? (y/n) (y) Plot results n 153 -22.61 GCAGCGACTGATTTGAGTT 169 -28.53 GTTCTGACCACTCAGATCC 172 -27.27 CTGACCACTCAGATCCGGC 219 -27.35 CCAGTGGCTGGCCTGCTAG 268 -27.82 CGAGGGATTTTCGATCTTG 274 -26.99 ATTTTCGATCTTGTGGATG 283 -25.79 CTTGTGGATGATTTTCACG 287 -27.50 TGGATGATTTTCACGTGCG 298 -28.17 CACGTGCGCCGTCATATTG 332 -28.27 TCTTTGAAGCAGAAGGGAC 351 -28.27 AGGGGTACACTTTCACATT 357 -25.05 ACACTTTCACATTTCGCTT 364 -28.51 CACATTTCGCTTATGGGAG 400 -23.77 GAAGTTACTAATGTGCGTG 451 -26.22 ATGCTCGCCCTCTTTGGTG 476 -28.00 TCCCTCACTGAGCCCTCCG 480 -28.33 TCACTGAGCCCTCCGCCTC 517 -23.46 GCTAAGATTCAGCTTGGTT 556 -27.27 TCCAGCACTCAGGTTCGGC 602 -27.01 AACTTGAATCCATCGTTGC 648 -28.45 TGCTAAACACAGCCGGTTT 679 -28.18 CTGTTTGCCCAGTTTGGGC 691 -28.51 TTTGGGCCGCTTCTGGACG 713 -27.67 GGCTTGACCGTGGCTGTGG 803 -25.47 ATGCTGACCATGCTTTTCA 848 -28.11 ATAATGTTAAGTTTGATTC 857 -25.97 AGTTTGATTCCGCTGGCCG 879 -27.85 CCGCTGCTGCTGTTTCCAC 917 -27.77 GCGATGAGGAAGGCTTGTT 931 -27.81 TTGTTGGCGCGCCTGCTCG 952 -23.52 GAGGTGACTACCATCCGTG 977 -28.40 TGCGTGGGTGAGCTGTTGT ? Menu or option number=6 Page through text files ? Name of file to read=1.WTS GCN4 SEQUENCES 19 1 -28.534 -18.854 P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 N 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 T 0 0 0 0 16 0 0 1 16 0 5 11 10 12 9 6 7 12 6 C 0 0 0 0 0 0 0 15 0 15 0 3 2 2 4 3 2 1 3 A 0 0 0 0 0 0 16 0 0 1 10 0 3 2 0 3 5 2 2 G 0 0 0 0 0 16 0 0 0 0 1 2 1 0 3 4 2 1 5 End of file .end lit .left margin1 @21. TX 3 @ Count base composition .LEFT MARGIN2 .para This routine calculates the base composition of the active region of the sequence as both totals and percentages. .left margin1 @22. TX 3 @ Count dinucleotide frequencies .LEFT MARGIN2 .para This routine simply counts dinucleotide frequencies for the currently active region of the sequence. It also calculates an expected distribution based on the base composition. The output looks like: .LIT T C A G obs expected obs expected obs expected obs expected T 8.44 8.25 6.67 7.01 10.35 9.92 3.27 3.54 C 7.49 7.01 6.76 5.95 8.39 8.43 1.76 3.01 A 10.13 9.92 7.78 8.43 11.74 11.93 4.89 4.26 G 2.67 3.54 3.19 3.01 4.06 4.26 2.42 1.52 .END LIT .left margin1 @23. TX 3 5 @ Count codons and amino acids .LEFT MARGIN2 .para This function counts codons, amino acid composition, protein molecular weights, and base composition. Users select the segments of the sequence that the program should analyse. .para Choose between being shown observed counts or counts normalised so that the totals for each amino acid sum to 100. Select to define segments using either the keyboard or an EMBL feature table. Define the segments to count over. Select strand for each segment. Stop selecting segments by typing a zero for "Count from ()". The results are displayed a screenful at a time, and the bell is sounded to show there is more to come. A zero start position, or the end of an EMBL feature table, signals the routine to print out totals for all values. .para The counts are broken down into several figures. Base composition by position in codon expressed as a percentage of each bases own frequency; base composition by position in codon expressed as a percentage of the overall base composition of the section; base composition expected for this amino acid composition if there was no codon preference; percentage deviations of the observed amino acid composition from an average amino acid composition. .para The output looks like: .LIT =========================================== F TTT 1. S TCT 2. Y TAT 2. C TGT 1. F TTC 1. S TCC 1. Y TAC 3. C TGC 2. L TTA 7. S TCA 4. * TAA 9. * TGA 1. L TTG 2. S TCG 1. * TAG 2. W TGG 2. =========================================== L CTT 3. P CCT 2. H CAT 4. R CGT 1. L CTC 2. P CCC 3. H CAC 1. R CGC 0. L CTA 3. P CCA 2. Q CAA 4. R CGA 0. L CTG 2. P CCG 2. Q CAG 1. R CGG 2. =========================================== I ATT 9. T ACT 1. N AAT 7. S AGT 3. I ATC 2. T ACC 2. N AAC 4. S AGC 2. I ATA 4. T ACA 5. K AAA 13. R AGA 5. M ATG 1. T ACG 2. K AAG 4. R AGG 1. =========================================== V GTT 2. A GCT 2. D GAT 1. G GGT 3. V GTC 2. A GCC 2. D GAC 1. G GGC 1. V GTA 4. A GCA 3. E GAA 2. G GGA 1. V GTG 2. A GCG 0. E GAG 1. G GGG 1. =========================================== total codons= 166. T C A G 1 31.06 33.68 34.03 35.00 2 35.61 35.79 30.89 32.50 3 33.33 30.53 35.08 32.50 1 24.70 19.28 39.16 16.87 2 28.31 20.48 35.54 15.66 3 26.51 17.47 40.36 15.66 % 26.51 19.08 38.35 16.06 observed, overall totals % 25.00 22.26 33.10 19.65 expected, even codons per acid A C D E F G H I K L 7. 3. 2. 3. 2. 6. 5. 15. 17. 19. o-e % -47. -33. -76. -68. -64. -54. 62. 116. 67. 67. M N P Q R S T V W Y 1. 11. 9. 5. 9. 13. 10. 10. 2. 5. o-e % -62. 66. 12. -17. 19. 21. 6. -2. 0. -5. total acids= 154. molecular weight= 17421. Typical dialogue follows. ? Menu or option number=23 Calculate codon usage, base composition and amino acid composition ? (y/n) (y) Show observed counts ? (y/n) (y) Define segments using keyboard ? Count from (0-1023) (0) =1 ? Count to (1-1023) (1023) =1000 ? (y/n) (y) + strand =========================================== F TTT 13. S TCT 1. Y TAT 1. C TGT 3. F TTC 4. S TCC 10. Y TAC 1. C TGC 7. L TTA 1. S TCA 0. * TAA 1. * TGA 4. L TTG 4. S TCG 1. * TAG 3. W TGG 5. =========================================== L CTT 9. P CCT 1. H CAT 3. R CGT 14. L CTC 7. P CCC 0. H CAC 7. R CGC 14. L CTA 0. P CCA 0. Q CAA 4. R CGA 9. L CTG 12. P CCG 1. Q CAG 9. R CGG 8. =========================================== I ATT 7. T ACT 4. N AAT 4. S AGT 1. I ATC 4. T ACC 5. N AAC 3. S AGC 7. I ATA 1. T ACA 1. K AAA 3. R AGA 2. M ATG 2. T ACG 1. K AAG 2. R AGG 2. =========================================== V GTT 11. A GCT 13. D GAT 6. G GGT 9. V GTC 5. A GCC 10. D GAC 9. G GGC 11. V GTA 6. A GCA 5. E GAA 6. G GGA 12. V GTG 8. A GCG 5. E GAG 3. G GGG 8. =========================================== Total codons= 333. T C A G 1 23.32 37.69 28.99 40.06 2 37.15 22.31 38.46 36.59 3 39.53 40.00 32.54 23.34 ----- ----- ----- ----- = 100% 100% 100% 100% 1 17.72 29.43 14.71 38.14 = 100% 2 28.23 17.42 19.52 34.83 = 100% 3 30.03 31.23 16.52 22.22 = 100% % 25.33 26.03 16.92 31.73 Observed, overall totals % 24.44 22.31 20.90 32.35 Expected, even codons per acid A C D E F G H I K L 33. 10. 15. 9. 17. 40. 10. 12. 5. 33. O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13. M N P Q R S T V W Y 2. 7. 2. 13. 49. 20. 11. 30. 5. 2. O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81. Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8 ? Count from (0-1023) (0) = Codon totals over all genes =========================================== F TTT 13. S TCT 1. Y TAT 1. C TGT 3. F TTC 4. S TCC 10. Y TAC 1. C TGC 7. L TTA 1. S TCA 0. * TAA 1. * TGA 4. L TTG 4. S TCG 1. * TAG 3. W TGG 5. =========================================== L CTT 9. P CCT 1. H CAT 3. R CGT 14. L CTC 7. P CCC 0. H CAC 7. R CGC 14. L CTA 0. P CCA 0. Q CAA 4. R CGA 9. L CTG 12. P CCG 1. Q CAG 9. R CGG 8. =========================================== I ATT 7. T ACT 4. N AAT 4. S AGT 1. I ATC 4. T ACC 5. N AAC 3. S AGC 7. I ATA 1. T ACA 1. K AAA 3. R AGA 2. M ATG 2. T ACG 1. K AAG 2. R AGG 2. =========================================== V GTT 11. A GCT 13. D GAT 6. G GGT 9. V GTC 5. A GCC 10. D GAC 9. G GGC 11. V GTA 6. A GCA 5. E GAA 6. G GGA 12. V GTG 8. A GCG 5. E GAG 3. G GGG 8. =========================================== Total codons= 333. T C A G 1 23.32 37.69 28.99 40.06 2 37.15 22.31 38.46 36.59 3 39.53 40.00 32.54 23.34 ----- ----- ----- ----- = 100% 100% 100% 100% 1 17.72 29.43 14.71 38.14 = 100% 2 28.23 17.42 19.52 34.83 = 100% 3 30.03 31.23 16.52 22.22 = 100% % 25.33 26.03 16.92 31.73 Observed, overall totals % 24.44 22.31 20.90 32.35 Expected, even codons per acid A C D E F G H I K L 33. 10. 15. 9. 17. 40. 10. 12. 5. 33. O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13. M N P Q R S T V W Y 2. 7. 2. 13. 49. 20. 11. 30. 5. 2. O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81. Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8 .END LIT .LEFT MARGIN1 @24. TX 3 @ Plot base composition .LEFT MARGIN2 .para This option plots the base composition of the sequence. The counts for any combination of bases can be plotted. .para If dialogue is requested the user is presented with a check box for selecting which bases should be counted, and then allowed to define a window length, and a "plot interval". Otherwise, the AT composition is plotted with a window of 101 and a plot interval of 5. .para Typical dialogue follows. .lit ? Menu or option number=d24 Plot base composition checkbox: those set are marked X X 1 T 2 C X 3 A 4 G ? 0,1,2,3,4 =1 checkbox: those set are marked X 1 T 2 C X 3 A 4 G ? 0,1,2,3,4 =3 checkbox: those set are marked X 1 T 2 C 3 A 4 G ? 0,1,2,3,4 =2 checkbox: those set are marked X 1 T X 2 C 3 A 4 G ? 0,1,2,3,4 =4 checkbox: those set are marked X 1 T X 2 C 3 A X 4 G ? 0,1,2,3,4 = ? odd span length (1-201) (31) = ? plot interval (1-11) (5) = missing graphics .end lit .left margIN1 @25. TX 3 @ Plot local deviations in base composition .LEFT MARGIN2 .para The "local deviation" routines are designed to indicate the similarity of the compositions of different parts of the sequence. The composition of every segment of the sequence is compared with a standard composition. The levels of similarity are plotted as a chi squared values. The standard can be the composition of the whole sequence, or alternatively that of a small segment defined by the user. .para If dialogue is forced define the standard region, the window length and the plot interval. Otherwise the composition of the whole sequence is taken as a standard. The maximum and minimum observed value of the chi squared calculation is displayed, and plots will always exactly fill the available box. Any unusual regions will show as peaks. .para The following measure is used: for each window position calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the observed composition and exp is the expected composition (the composition of the standard). The calculation is performed once to find out the range of values and is then repeated and plotted so that the plot exactly fills the allocated screen space. .left margIN1 @26. TX 3 @ Plot local deviations from dinucleotide composition .LEFT MARGIN2 .para The "local deviation" routines are designed to indicate the similarity of the compositions of different parts of the sequence. The dinucleotide composition of every segment of the sequence is compared with a standard composition. The levels of similarity are plotted as a chi squared values. The standard can be the composition of the whole sequence, or alternatively that of a small segment defined by the user. .para If dialogue is forced define the standard region, the window length and the plot interval. Otherwise the composition of the whole sequence is taken as a standard. The maximum and minimum observed value of the chi squared calculation is displayed, and plots will always exactly fill the available box. Any unusual regions will show as peaks. .para The following measure is used: for each window position calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the observed composition and exp is the expected composition (the composition of the standard). The calculation is performed once to find out the range of values and is then repeated and plotted so that the plot exactly fills the allocated screen space. .left margin1 @27. TX 3 @ Plot local deviations from trinucleotide composition .LEFT MARGIN2 .para The "local deviation" routines are designed to indicate the similarity of the compositions of different parts of the sequence. The trinucleotide composition of every segment of the sequence is compared with a standard composition. The levels of similarity are plotted as a chi squared values. The standard can be the composition of the whole sequence, or alternatively that of a small segment defined by the user. .para If dialogue is forced define the standard region, the window length and the plot interval. Otherwise the composition of the whole sequence is taken as a standard. The maximum and minimum observed value of the chi squared calculation is displayed, and plots will always exactly fill the available box. Any unusual regions will show as peaks. .para The following measure is used: for each window position calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the observed composition and exp is the expected composition (the composition of the standard). The calculation is performed once to find out the range of values and is then repeated and plotted so that the plot exactly fills the allocated screen space. .left margin1 @28. TX 5 @ Calculate codon constraint .left margin2 .para The purpose of this option (which is somewhat specialised) is to measure the level of constraint imposed on the sequence by coding for a protein of the observed composition. It measures the strength of the codon bias averaged over windows of 99 codons and displays the values observed. .para Select between defining segments at the keyboard or using an EMBL feature table. Finish selecting segments by typing a zero start. The value for each segment is displayed: .para Mean (W-EW) / EWD, window 99 10.5 .para The codon constraint is the difference between the observed codon improbability and the mean improbabilty for a sequence of the same composition. See McLachlan, Staden and Boswell Nucl. Acid Res. 1984 .left margin1 @59. TX 3 @ Plot negentropy .LEFT MARGIN2 .para This routine is designed to show regions of the sequence that differ in composition from others, and hence is like the "plot deviation.." routines. .para Negentropy or information is defined in the following way: let Pi be the probability of observing base i, where i = A,C,G or T, then the average information per base is I=-sum(Pi.Log(Pi)) (sum over all i). This routine calculates Pi by calculating the overall composition for the sequence and then plots I for windows of length defined by the user. .left margin1 @30. TX 4 @ Search for hairpin loops .LEFT MARGIN2 .para Used to find simple inverted repeats or potential hairpin loops The loops are defined by a range of sizes for the loop and a minimum number of consecutive base pairs in the stem. The results can be presented graphically or listed. A-T, G-C and G-T basepairs are counted. .para Define the range of loop sizes and the minimum number of consecutive basepairs required. Choose between plotted or listed results. .para The loops found are plotted as blips on a horizontal line that represents the sequence, the heights of the lines are proportional to the number of basepairs in the stems. Note that only uninterrupted stems are found - i.e. all basepairs must be made. To look for stems with some unpaired bases (or for palindromes) use the inverted repeat motif class in the pattern searching option. .para Typical dialogue follows. .lit ? Menu or option number=30 Search for hairpin loops Define the range of loop sizes ? Minimum loop size (1-30) (1) = ? Maximum loop size (3-60) (3) = ? Minimum number of basepairs (2-20) (6) = ? (y/n) (y) Plot results n Searching T.G G-C G.T T.G C-G G-C T.G C-G G.T GCCGCA GCGGAGG 49 G G-C T.G C-G G.T T.G G-C CTGCTG GGAGGTC 56 G T.G G-C G.T T.G C-G G-C T-A T.G AGCGCA CGACTGA 139 A C G.T C-G G.T C-G C-G G-C TTCGCT CAACGCC 244 .end lit .LEFT MARGIN1 @31. TX 4 @ Search for long range inverted repeats .LEFT MARGIN2 .para Searches for inverted repeats. The repeats found are exact matches of at least 6 consecutive bases. Results can be presented graphically or listed. Plotted results show the end points of repeats joined by rectangular lines. .para If dialogue is not requested the defaults will be taken. Otherwise choose between plotted or listed results. If required select to analyse a restricted segment of the currently active region. Choose a repeat length. .para Typical dialogue follows. .lit ? Menu or option number=D31 Plot long-range inverted repeats ? (y/n) (y) Plot results n Define restricted region ? start (1-1023) (1) = ? end (2-1023) (1023) = ? Minimum inverted repeat (6-30) (12) =10 Searching 27 909 10 TGCCCAGAGA .end lit .LEFT MARGIN1 @32. TX 4 @ Search for repeats .LEFT MARGIN2 .para Searches for direct repeats. The repeats found are exact matches of at least 6 consecutive bases. Results can be presented graphically or listed. Plotted results show the end points of repeats joined by rectangular lines. .para If dialogue is not requested the defaults will be taken. Otherwise choose between plotted or listed results. If required select to analyse a restricted segment of the currently active region. Choose a repeat length. .para Typical dialogue follows. .lit ? Menu or option number=D32 Plot repeats ? (y/n) (y) Plot results n Define restricted region ? start (1-1023) (1) = ? end (2-1023) (1023) = ? Minimum repeat (6-30) (12) =8 Searching 619 988 8 GCTGTTGT 514 646 8 GCTGCTAA 94 865 8 TCCGCTGG 146 222 9 GTGGCTGGC 455 497 8 TCGCCCTC 454 496 9 CTCGCCCTC 872 875 8 GCCGCCGC 510 615 8 CGTTGCTG 152 913 8 GGCAGCGA 199 265 8 CGTCGAGG 689 794 8 AGTTTGGG 147 223 8 TGGCTGGC 101 116 8 GACGAGGA 8 690 8 GTTTGGGC 52 141 8 TGCTGGTG .end lit .left margin1 @33. TX 4 @ Search for z dna (total ry, yr) .LEFT MARGIN2 .para Searches for segments of the sequence that might form Z DNA. A window length is chosen and the number of RY and YR dinucleotides within each window is plotted. The top of the box corresponds to all RY or YR, the bottom to zero RY or YR. .para If dialogue is requested, select a window length and plot interval. Otherwise the defaults will be used. .para The program contains three separate ways of doing this (options 33,34,35). .left margin1 @34. TX 4 @ Search for z dna (runs of ry, yr) .LEFT MARGIN2 .para Searches for segments of the sequence that might form Z DNA. Results are plotted. .para If dialogue is requested define a window length and plot interval. Otherwise the defaults will be used. The routine counts the number of R in positions 1,3,5 etc =R1, the number of Y in positions 2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2 and the number of R in positions 2,4,6 etc =R2 for a window length. It plots the maximum of R1+Y1 and R2+Y2 relative to a minimum of (window length)/2 and a maximum of (window length). (see 33,35). .LEFT MARGIN1 @35. TX 4 @ Search for z dna (best phased value) .LEFT MARGIN2 .para Searches for segments of the sequence that might form Z DNA. Results are plotted. .para If dialogue is requested define a window length and a plot interval. Ohterwise the defaults values will be used. .para The routine counts the number of consecutive RY or YR dinucleotides in phase. It moves through the sequence counting the number of RY or YR dinucleotides; when the next dinucleotide is not of the correct type the score is set back to zero and the search restarted using the current base to set the phase. The plots are done relative to a minimum of zero and a maximum defined by the user. (See 33,34). .LEFT MARGIN1 @36. TX 4 @ Local similarity or complementarity search .LEFT MARGIN2 .PARA This function is designed to find segments of local similarity or complementarity. It is therefore like performing a DIAGON plot that is restricted to regions near the main diagonal. Results can be presented graphically or listed. .para Users define a region to search through, a span length, a range for searching through and a cut-off score. The program takes all sections of sequence of length span within the defined region and compares them to all other sequences within the region and range specified. If a match above the cutoff is found we need to show the position of the two sections of sequence and the score, and we do it in the following way. If we have a 70% match between a sequence that starts at p1 and a sequence that starts at p2 the program draws a diagonal line that starts at p1 with height 70% of the box and which finishes at p2 with height 0. The matches can also be listed. .para Here I define the terms range, region, and span and what is compared. Suppose we have a defined region j1 to j2, a range of i1 to i2 and a span of s; the program will take, in turn, all sections of sequence of length s within j1 and j2 and compare them to all sequences that start a distance i1+s-1 to i2+s-1 away from them. First it will take the sequence of length s starting at j1 and compare it with the sequence of length s starting at j1+s-1+i1, then j1+s-1+i1+1, etc up to j1+s-1+i2; then it will take the sequence of length s starting at j1+1 and compare it with the sequence starting at j1+s-1+1+i1 etc. This continues until we hit the right hand end of the sequence as defined by j2. Note 1)that sequences are not compared with themselves: the nearest sequence compared to a span s starting at j starts at j+s; 2) ranges i1 and i2 are ranges of start positions; 3) by choosing a range greater than the length of the sequence this routine will do a full DIAGON analysis except for those points within a distance span of the main diagonal (see note 1). .para Typical dialog follows. .lit ? Menu or option number=36 Search for local similarity or complementarity ? (y/n) (y) Find direct repeats ? (y/n) (y) Keep picture n ? Span (5-200) (15) = Define restricted region ? start (0-1023) (1) = ? end (2-1023) (1023) = ? Percent match (1.00-100.00) (70.00) = ? Range start (1-50) (1) = ? Range end (1-50) (1) =5 ? (y/n) (y) Plot results n Working 118 128 CGAGGAGGAG GTGGA ** ***** ** ** GGACGAGGAC GTCGA 100 110 119 129 GAGGAGGAGG TGGAT ** ***** * * ** GACGAGGACG TCGAC 101 111 ? (y/n) (y) Find direct repeats n ? (y/n) (y) Keep picture ? Span (5-200) (15) = Define restricted region ? start (0-1023) (1) = ? end (2-1023) (1023) = ? Percent match (1.00-100.00) (70.00) = ? Range start (1-50) (1) = ? Range end (1-50) (5) =8 ? (y/n) (y) List results Working 178 188 ACTCAGATCC GGCGG ***** *** * ** ACTCAAATCA GTCGC 156 166 177 187 CACTCAGATC CGGCG ***** *** * ** AACTCAAATC AGTCG 157 167 ? (y/n) (y) Find inverted repeats ! .end lit .left margin1 @37. TX 5 @ Set genetic code .LEFT MARGIN2 .para This function allows the user to change the current active genetic code for all the options. The user may select: the standard code, the mammalian mitochondrial code, the yeast mitochondrial code or a personal code (define your own). .para Select code. If personal, define a codon and select an amino acid. When all codons have been reset define a blank codon. .para The code differences are: .lit Mammalian Yeast Codon Mitochondrial Mitochondrial Standard UGA W W STOP AUA M M I CUA L T L AGA STOP R R AGG STOP R R .END LIT .para Typical dialogue follows. .lit ? Menu or option number=37 X 1 Standard code 2 Mammalian mitochondrial code 3 Yeast mitochondrial code 4 Personal code ? 0,1,2,3,4 =2 ? Menu or option number=37 X 1 Standard code 2 Mammalian mitochondrial code 3 Yeast mitochondrial code 4 Personal code ? 0,1,2,3,4 =4 Define genetic code by typing a codon followed by a 1 letter amino acid symbol ? Codon=TTT Default Amino acid symbol=F ? Amino acid symbol=W ? Codon= .end lit .left margin1 @38. T 3 4 @ Examine repeats .left margin2 .para This function can be used to examine the frequencies of repeated words within a sequence. It finds all words that occur more than once. The user selects a minimum word length and the program finds all words of that length that occur more than once; then it "follows" each repeated word until it becomes unique. For each word length it can report the number of different repeated words, the number of occurrences of each word, and their actual positions and sequences. .para It is possible that the algorithm may run out of memory, paticularly if a short mimimum word length is chosen, or if the sequence is very long or very repetitive. If this occurs the longest reported word length will not necessarily be the longest in the sequence: the memory will have been consumed before the longest word is found. .lit Typical dialogue and output is shown below. Expected length of longest repeat 14 ? Minumim word length (1-6) (6) =6 Working ? Show repeat frequencies for words of at least length (6-15) (15) =10 For length 10 the number of different repeated words is 2035 For length 11 the number of different repeated words is 613 For length 12 the number of different repeated words is 161 For length 13 the number of different repeated words is 37 For length 14 the number of different repeated words is 10 For length 15 the number of different repeated words is 1 ? Show repeats for words of length (6-15) (15) =14 ? Show repeats for words occuring with frequency (2-9999) (2) =2 ggtgctcatgccca occurs at 21611 occurs at 21851 ttatccggtgatga occurs at 4604 occurs at 8806 agcaccacgctgac occurs at 5954 occurs at 9486 catgacggaggatg occurs at 10480 occurs at 19925 aaagacgggaaaat occurs at 11820 occurs at 43157 tacaaaaccaattt occurs at 26797 occurs at 31369 cgagaaagagtgcg occurs at 4260 occurs at 44305 gccggatgatggcg occurs at 7893 occurs at 16638 atgacggaggatga occurs at 10481 occurs at 19926 gcggcgaacgaggc occurs at 11352 occurs at 18718 ? Show repeats for words of length (6-15) (15) =! Example of not enough memory ---------------------------- Expected length of longest repeat 14 ? Minumim word length (1-6) (6) =1 Working Not enough memory Memory used in bytes 1125996. Length of longest repeat 5 ? Show repeat frequencies for words of at least length (1-5) (5) =! .end lit .left margin1 @39. TX 5 @ Translate and list in upto six phases .LEFT MARGIN2 .para This is a general listing function that will perform translations and produce several forms of output. The possibilities are: .lit 1) no translation, list one or two strands, two ways of numbering the sequence. 2) translation, one or two strands, one or three letter codes. Positions defined by: a) open reading frames of some minimum length l, l can be 0, hence giving a complete six phase translation. b) positions typed on keyboard, again 1 to 6 phases, translations appearing above and below the dna. c) positions read from a feature table. It should be used in preference to option 5. For publication without a translation, the option to number ends of lines is more compact than option 5. Some examples and typical dialogue are given below. Note the requirement for d39. ? Menu or option number=D39 Find open reading frames, translate and list ? (y/n) (y) Show translation The segments to translate can be 1 Typed on the keyboard 2 Read from a feature table X 3 Open reading frames ? 1,2,3 = ? Minimum open frame in amino acids (0-7238) (30) = ? (y/n) (y) Use 1 letter codes Define section of DNA to display ? start (1-7238) (1) = ? end (2-7238) (7238) =300 ? Line length (30-120) (60) = Which strands should be shown X 1 + strand only 2 - strand only 3 Both strands ? 1,2,3 =3 ? (y/n) (y) Number ends of lines N A T T I S R I D A T F S A R A P N E N AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60 . : . : . : . : . : . : TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA * S A G W I F I A V V I L L I S A V K E A R A G F S F I A K Q V I D H L R N V S N G Q T K S T L N R L L T I C E M Y L M V K L N L L ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120 . : . : . : . : . : . : TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA Y S F L N N V M Q S I Y R I T L S F R S I A L C T I S W K R F T D L P * V L D V R S Q N W E S T V T W N E T S R H R T L V R R I G N Q L L H G M K L P D T V L * CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180 . : . : . : . : . : . : GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT T R L I P F R E C F Q S D V T V H F S V E L C R V K V A Y L K H V E L Q H Q I Q Q L S S K P GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240 . : . : . : . : . : . : CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT T A Y K F C T S S C C W I S A K M T S Y Q K E Q L K V L S N P D L TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300 . : . : . : . : . : . : AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC ? Menu or option number=D39 Find open reading frames, translate and list ? (y/n) (y) Show translation N Define section of DNA to display ? start (1-7238) (1) = ? end (2-7238) (7238) =300 ? Line length (30-120) (60) = Which strands should be shown X 1 + strand only 2 - strand only 3 Both strands ? 1,2,3 = ? (y/n) (y) Number ends of lines AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60 ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120 CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180 GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240 TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300 ? Menu or option number=D39 Find open reading frames, translate and list ? (y/n) (y) Show translation The segments to translate can be 1 Typed on the keyboard 2 Read from a feature table X 3 Open reading frames ? 1,2,3 = ? Minimum open frame in amino acids (0-7238) (30) =0 ? (y/n) (y) Use 1 letter codes N Define section of DNA to display ? start (1-7238) (1) = ? end (2-7238) (7238) =300 ? Line length (30-120) (60) = Which strands should be shown X 1 + strand only 2 - strand only 3 Both strands ? 1,2,3 =3 ? (y/n) (y) Number ends of lines AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60 . : . : . : . : . : . : TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr ***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120 . : . : . : . : . : . : TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu*** SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180 . : . : . : . : . : . : GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer*** AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240 . : . : . : . : . : . : CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300 . : . : . : . : . : . : AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg ? Menu or option number=D39 Find open reading frames, translate and list ? (y/n) (y) Show translation The segments to translate can be 1 Typed on the keyboard 2 Read from a feature table X 3 Open reading frames ? 1,2,3 =1 ? (y/n) (y) Use 1 letter codes Define section of DNA to display ? start (1-7238) (1) = ? end (2-7238) (7238) =300 ? Line length (30-120) (60) = Which strands should be shown X 1 + strand only 2 - strand only 3 Both strands ? 1,2,3 = ? (y/n) (y) Number ends of lines N Translate ? From (0-300) (0) =101 ? To (1-300) (300) =300 Translate ? From (0-300) (0) =102 ? To (1-300) (300) =200 Translate ? From (0-300) (0) = AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 10 20 30 40 50 60 M V K L N L L W S N * I Y ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 70 80 90 100 110 120 V R R I G N Q L L H G M K L P D T V L * S F A E L G I N C Y M E * N F Q T P Y F CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 130 140 150 160 170 180 L H I * N M L S Y S T R F S N * A L S H S C I F K T C GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 190 200 210 220 230 240 P Q K * P L I K R S N * R Y S L I L T C TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 250 260 270 280 290 300 ? Menu or option number=D39 Find open reading frames, translate and list ? (y/n) (y) Show translation The segments to translate can be 1 Typed on the keyboard 2 Read from a feature table X 3 Open reading frames ? 1,2,3 =2 ? Embl feature table file=1.FT ? (y/n) (y) Use 1 letter codes Define section of DNA to display ? start (1-7238) (1) = ? end (2-7238) (7238) =300 ? Line length (30-120) (60) = Which strands should be shown X 1 + strand only 2 - strand only 3 Both strands ? 1,2,3 =3 ? (y/n) (y) Number ends of lines N A T T I S R I D A T F S A R A P N E N AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60 . : . : . : . : . : . : TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA * S A G W I F I A V V I L L I S A V K E A R A G F S F I A K Q V I D H L R N V S N G Q T K S T L N R L L T I C E M Y L M V K L N L L ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120 . : . : . : . : . : . : TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA Y S F L N N V M Q S I Y R I T L S F R S I A L C T I S W K R F T D L P * V L D V R S Q N W E S T V T W N E T S R H R T L V R R I G N Q L L H G M K L P D T V L * CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180 . : . : . : . : . : . : GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT T R L I P F R E C F Q S D V T V H F S V E L C R V K V A Y L K H V E L Q H Q I Q Q L S S K P GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240 . : . : . : . : . : . : CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT T A Y K F C T S S C C W I S A K M T S Y Q K E Q L K V L S N P D L TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300 . : . : . : . : . : . : AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC * L Y E R I R V Q * F S C N F T S E L G S R .end lit .left margin1 @40. TX 5 @ Translate and write the protein sequence to disk .LEFT MARGIN2 .para This routine allows the user to translate sections of the sequence into the 1 letter amino acid codes and store the resulting amino acid sequences in a disk file. Two modes of use are possible. Either all open reading frames of at least some minimum length will automatically be found and translated, or the user can specify that particular segments be translated. .para Mode 1: the user selects to to translate all open reading frames. .para Either, or both, strands can be translated. The output file is in the same format as a PIR .seq file. Each protein segment is given an entry name that is its start base in the DNA, and a title that includes its end position, reading frame and strand (+ for plus, - for minus). Each segment is terminated by * whether or not there is a stop codon in the DNA. The file is therefore suitable for input to FASTA, ALIGNL and ANALYSEPL. .para Mode 2: the user selects to identify the segments to translate. .para Either, or both, strands can be translated. If multiple coding regions are translated each will be separated from the previous one by a gap of 5 dashes (-----). The sections to translate can be defined from the keyboard or by supplying the name of the appropriate EMBL library feature table. .para Typical dialogue follows. .lit ? Menu or option number=40 Translate and write protein sequence to disk ? (y/n) (y) Translate selected regions ? (y/n) (y) Define segments using keyboard Translate ? From (0-1023) (0) =1 ? To (1-1023) (1023) =111 ? (y/n) (y) + strand Translate ? From (0-1023) (0) = ? Output file name=1.OUT ? Menu or option number=40 Translate and write protein sequence to disk ? (y/n) (y) Translate selected regions n ? Minimum open frame in amino acids (5-1000) (30) = X 1 + strand only 2 - strand only 3 Both strands ? 0,1,2,3 =3 ? File name for translation=1.OUT ? Menu or option number=6 Page through text files ? Name of file to read=1.OUT >P1; 25 135 1 + GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG* >P1; 238 486 1 + IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR VDAGADFAAVRMLALFGAVVPH* >P1; 556 795 1 + SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK GAGRRDQWRGGGVDLANGR* >P1; 796 987 1 + FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV LRG* >P1; 2 163 2 + NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD* >P1; 176 391 2 + PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG YTFTFRLWEVT* >P1; 476 628 2 + SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL* >P1; 629 958 2 + MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR* >P1; 3 293 3 + IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG* >P1; 411 521 3 + CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC* >P1; 1020 37 1 - EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS SSSSTSSSSGSLGPLTTSAPPAPAAAAQ* >P1; 373 -1 2 - AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS PNRI* >P1; 517 407 2 - QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH* >P1; 649 518 2 - QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES* >P1; 853 650 2 - HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER RCKTGCV* >P1; 958 854 2 - SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT* >P1; 179 78 3 - VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR* >P1; 479 363 3 - GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM* >P1; 791 666 3 - RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM* >P1; 1022 819 3 - SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS EWRLVERR* End of file .end lit .LEFT MARGIN1 @41. TX 5 @ Calculate and write codon table to disk .LEFT MARGIN2 .para This routine calculates codon usage tables for sections of the sequence and stores the resulting tables on disk. The sections to translate can be defined from the keyboard or by supplying the name of the appropriate EMBL library feature table. .para If required users can add to an existing codon table stored as a disk file. Choose between storing observed counts or having them normalised so that the totals for each amino acid sum to 100. Select between defining segments at the keyboard or using an EMBL feature table. Define segments. Signal completion with a zero start. Supply a file name. For each segment the program will display the counts, at the end it will display the accumulated totals. .lit Typical dialogue follows. ? Menu or option number=41 Calculate and write codon table to disk ? (y/n) (y) Start with empty table ? (y/n) (y) Show observed counts ? (y/n) (y) Define segments using keyboard ? Count from (0-1023) (0) =1 ? Count to (1-1023) (1023) =111 ? (y/n) (y) + strand =========================================== F TTT 0. S TCT 0. Y TAT 0. C TGT 0. F TTC 1. S TCC 1. Y TAC 0. C TGC 3. L TTA 1. S TCA 0. * TAA 0. * TGA 1. L TTG 2. S TCG 0. * TAG 0. W TGG 2. =========================================== L CTT 0. P CCT 0. H CAT 0. R CGT 2. L CTC 0. P CCC 0. H CAC 0. R CGC 2. L CTA 0. P CCA 0. Q CAA 1. R CGA 1. L CTG 1. P CCG 0. Q CAG 2. R CGG 2. =========================================== I ATT 0. T ACT 0. N AAT 0. S AGT 0. I ATC 0. T ACC 1. N AAC 0. S AGC 1. I ATA 0. T ACA 0. K AAA 0. R AGA 1. M ATG 0. T ACG 0. K AAG 0. R AGG 0. =========================================== V GTT 0. A GCT 1. D GAT 0. G GGT 3. V GTC 0. A GCC 1. D GAC 0. G GGC 1. V GTA 0. A GCA 0. E GAA 1. G GGA 4. V GTG 1. A GCG 0. E GAG 0. G GGG 0. =========================================== ? Count from (0-1023) (0) = Codon totals over all genes =========================================== F TTT 0. S TCT 0. Y TAT 0. C TGT 0. F TTC 1. S TCC 1. Y TAC 0. C TGC 3. L TTA 1. S TCA 0. * TAA 0. * TGA 1. L TTG 2. S TCG 0. * TAG 0. W TGG 2. =========================================== L CTT 0. P CCT 0. H CAT 0. R CGT 2. L CTC 0. P CCC 0. H CAC 0. R CGC 2. L CTA 0. P CCA 0. Q CAA 1. R CGA 1. L CTG 1. P CCG 0. Q CAG 2. R CGG 2. =========================================== I ATT 0. T ACT 0. N AAT 0. S AGT 0. I ATC 0. T ACC 1. N AAC 0. S AGC 1. I ATA 0. T ACA 0. K AAA 0. R AGA 1. M ATG 0. T ACG 0. K AAG 0. R AGG 0. =========================================== V GTT 0. A GCT 1. D GAT 0. G GGT 3. V GTC 0. A GCC 1. D GAC 0. G GGC 1. V GTA 0. A GCA 0. E GAA 1. G GGA 4. V GTG 1. A GCG 0. E GAG 0. G GGG 0. =========================================== ? (y/n) (y) Save table in a file n .end lit .left margin1 @42. TX 6 @ Codon usage method .LEFT MARGIN2 .para Used to find protein coding regions. For each window length of the sequence the routine measures the closeness to an expected codon usage. Results are plotted for each of the three reading frames. Stop and start codons are also marked on the plots. Has the highest resolution of all such methods, but makes the strongest assumption, i.e. that the codon usage is known. The latest version is described in Methods in Enzymology 183, 193-211. .para Choose whether to use an internal standard (i.e. part of the current sequence known to code for a protein). If so define its end points, and those of any others. Otherwise supply the name of a disk file containing a table of codon usage. Tables are listed. Choose between using the observed counts, or two types of normalisation: normalised to give an average amino acid composition; normalised to no amino acid bias. The first normalisation is clearly often sensible, but the second removes valuable information and is only made availabe for special circumstances. The final table will be displayed, followed by the expected scores for window lengths 21, 31 and 41 codons. The scores for each of the three reading frames are shown (they are logarithmic values) to help users choose a window length for the analysis. Define a window length and plot interval. Plotting will start. .para The method was first described in Staden and McLachlan Nucl. Acid Res. 10 141-156 (1982) and the following is a summary of the initial ideas. The method makes the following main assumptions: the codon preferences of all the genes in the sequence we are examining are similar to those of the standard; the sequence is coding throughout its whole length in only one reading frame; in the coding frame the frequency of codon abc has a definite value Fabc .LEFT MARGIN2 If we select a sequence a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1 then the probability of selecting it in each of the three frames is: .left margin15 frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn .left margin15 frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1 .left margin15 frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1 .LEFT MARGIN2 The probability that selection of a particular sequence was "caused" by it being a coding sequence is: .LEFT MARGIN2 P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3). .LEFT MARGIN2 The program calculates these values for the given window length but plots Log(P/(1-P)) for each of the three frames. At each point along the sequence that the program has a point to plot it finds which of the three values is highest and places a single point at the 50% level for the corresponding frame. These single points will join to form a solid line if one frame is consistently the highest scoring. In addition stop codons are shown as short vertical lines that bisect the 50% level of probability. When looking for coding regions the user should look for solid horizontal lines at the 50% level that are not interrupted by these short vertical lines. .para Changes. Two normalisations are offered: 1) to remove all amino acid compositional components from the tables, hence leaving only the codon preference components. In general this is not recommended as the amino acid component alone is often sufficient to choose correctly between frames, but may be useful in special circumstances. 2) to change the amino acid composition components to give an average amino acid composition rather the the one contained in the standard (this leaves the codon preference components unchanged). In general this should be useful as the average amino acid composition is likely to be closer to the composition of the genes being hunted, than is that of the standard table of codon preferences. The average composition is that recently published by Argos, not the Dayhoff one that we have used before. .para Typical dialogue follows. .lit ? Menu or option number=42 Staden and McLachlan codon usage method Codon tables for standards may be read from disk or calculated from parts of the current sequence ? (y/n) (y) Define internal standard Define standard ? start (0-1023) (0) =1 ? end (2-1023) (1023) =1000 =========================================== F TTT 13. S TCT 1. Y TAT 1. C TGT 3. F TTC 4. S TCC 10. Y TAC 1. C TGC 7. L TTA 1. S TCA 0. * TAA 1. * TGA 4. L TTG 4. S TCG 1. * TAG 3. W TGG 5. =========================================== L CTT 9. P CCT 1. H CAT 3. R CGT 14. L CTC 7. P CCC 0. H CAC 7. R CGC 14. L CTA 0. P CCA 0. Q CAA 4. R CGA 9. L CTG 12. P CCG 1. Q CAG 9. R CGG 8. =========================================== I ATT 7. T ACT 4. N AAT 4. S AGT 1. I ATC 4. T ACC 5. N AAC 3. S AGC 7. I ATA 1. T ACA 1. K AAA 3. R AGA 2. M ATG 2. T ACG 1. K AAG 2. R AGG 2. =========================================== V GTT 11. A GCT 13. D GAT 6. G GGT 9. V GTC 5. A GCC 10. D GAC 9. G GGC 11. V GTA 6. A GCA 5. E GAA 6. G GGA 12. V GTG 8. A GCG 5. E GAG 3. G GGG 8. =========================================== Define standard ? start (0-1023) (0) = Total codons in standard= 333. X 1 Use observed frequencies 2 Normalize to average amino acid composition 3 Normalize to no amino acid bias ? 0,1,2,3 =2 =========================================== F TTT 19. S TCT 2. Y TAT 10. C TGT 3. F TTC 6. S TCC 22. Y TAC 10. C TGC 8. L TTA 2. S TCA 0. * TAA 0. * TGA 0. L TTG 7. S TCG 2. * TAG 0. W TGG 8. =========================================== L CTT 16. P CCT 16. H CAT 4. R CGT 10. L CTC 12. P CCC 0. H CAC 10. R CGC 10. L CTA 0. P CCA 0. Q CAA 8. R CGA 7. L CTG 21. P CCG 16. Q CAG 18. R CGG 6. =========================================== I ATT 19. T ACT 13. N AAT 16. S AGT 2. I ATC 11. T ACC 17. N AAC 12. S AGC 15. I ATA 3. T ACA 3. K AAA 22. R AGA 1. M ATG 15. T ACG 3. K AAG 15. R AGG 1. =========================================== V GTT 15. A GCT 21. D GAT 14. G GGT 10. V GTC 7. A GCC 16. D GAC 20. G GGC 13. V GTA 8. A GCA 8. E GAA 26. G GGA 14. V GTG 11. A GCG 8. E GAG 13. G GGG 9. =========================================== Span length 21 expected mean values: 4.8 -5.7 -4.8 Span length 31 expected mean values: 7.1 -8.4 -7.2 Span length 41 expected mean values: 9.5 -11.1 -9.5 ? odd span length (11-101) (25) =41 ? plot interval (1-11) (5) = Missing graphics display here .end lit .left margin1 @43. TX 6 @ Positional base preference method. .LEFT MARGIN2 .para Used to find protein coding regions. For each window length of the sequence the routine measures the closeness to an expected pattern of base frequencies . Results are plotted for each of the three reading frames. Stop and start codons are also marked on the plots. The method is particularly useful for showing which reading frame is the most likely to be coding. The latest version is described in a forthcoming issue of Methods in Enzymology, but the original ideas were given in Staden, R. Nucl. Acid Res. 12 551-567 (1984). .para If dialogue is requested the following inputs are needed, otherwise the standard analysis is performed. Choose between a "global" standard, or a selected one. If the global standard is selected the expected scores are displayed and the user asked to define a span length and a plot interval. Then users choose between plotting relative or absolute scores, and can reset the scaling values employed for plotting. If the global standard is not selected users must define a region of the sequence to use as a standard, or they can read in a codon table from which the program will calculate one. Then they can either, use the values observed in this standard, or they can combine its values for the third positions in codons, with those from the global standard. Next they can give different weightings to each of the three positions in codons. .para In its original form the method took advantage of the uneven use of amino acids by proteins and the structure of the genetic code table and assumed that there is a typical ("global") amino acid composition and no codon preference. The typical amino acid composition is the average composition found by Argos (see below). This composition and no codon preference determines the frequency of each of the four bases in each of the three codon positions. This 3x4 frequency table shows unequal use of the bases and in particular a marked use of G in position 1 and of A in position 2 (at the expence of G). The routine slides a window along the sequence and calculates a score for each of the three reading frames at each window position. It assumes the sequence is coding throughout its whole length and calcualtes the probability that it is coding in each of the three frames. When tested against all the E. coli sequences in the EMBL sequence library it correctly identified the coding frame for 91% of window positions. (The E. coli sequences were chosen only for technical reasons: I have no reason to think the method would work less well on other organisms with roughly even base composition.) The routine can plot either absolute or relative values: ie absolute values are the values found by summing the scores for each frame (say p1, p2 and p3), and the relative values are then p1/(p1+p2+p3), p2/(p1+p2+p3) and p3/(p1+p2+p3). .para At each point along the sequence that the program has a point to plot it finds which of the three values is highest and places a single point at the 50% level for the corresponding frame. These single points will join to form a solid line if one frame is consistently the highest scoring. In addition stop codons are shown as short vertical lines that bisect the 50% level of probability. When looking for coding regions the user should look for solid horizontal lines at the 50% level that are not interrupted by these short vertical lines. The absolute mean values expected on the complement of the coding strand (and in the same frame) are 5% lower than those on the coding strand but the relative values are the same on both strands. Although the relative values give smoother plots and tend to emphasize the coding frame they therefore, cannot be used to decide which strand is coding. The absolute values plot should be used for this purpose but bearing in mind the fact the the differences between strands are quite small. .para The method has been improved in two overall ways: first it now allows users to define their own typical amino acid composition by selecting a standard sequence from within the sequence they are analysing or from a codon table; secondly it allows the inclusion of third position preferences. Again these third position preferences are defined by the use of an internal standard sequence. Not only can users define their own standards but they can also give weights to each of the three positions in codons. This allows different emphasis to be used for each of the three positions. As an example of its use, by giving, in turn, weights of 1.0, 0.0, 0.0, and 0.0, 1.0, 0.0, and finally 0.0, 0.0, 1.0, you could see the separate contribution made by each of the three positions. It is also possible to use the third position preferences with the values for the first two positions taken from the "global" amino acid composition. In all cases users may choose to plot absolute or relative values. The expected scores are displayed before each analysis and scales are drawn on the plots. At present this method does not give probabilities of coding; it has only been tested for its ability to choose the correct reading frame (see above). It could be used to give probabilities of coding if was applied to all known coding and non-coding sequences in the way that the uneven positional base frequencies method was. It is designed to be used in conjunction with this method. Note that the average amino composition used to derive the base frequencies was changed on 17-11-1988, to be the new average given by McCaldon and Argos in Proteins 4 99-122 (1988). A further change is to allow users to select their own scales for producing the plots. It can be helpful if they want to emphasise or diminish certain features. .para Typical dialogue follows. .lit ? Menu or option number=D43 Positional base preferences method to find protein genes Select standard source X 1 Use global standard 2 Use internal standard 3 Use codon usage table ? Selection (1-3) (1) =2 Define region for standard ? start (0-8134) (0) =3171 ? end (3172-8134) (8134) =4700 Select normalisation X 1 Use observed frequencies 2 Combine with global standard ? Selection (1-2) (1) =1 T C A G Range 1 0.125 0.249 0.230 0.397 0.272 2 0.298 0.245 0.292 0.165 0.132 3 0.288 0.313 0.169 0.230 0.144 ? (y/n) (y) Use 1.0 for positional weights Give weights between 0.0 and 1.0 to each of the 3 codon positions ? Position 1 (0.00-1.00) (1.00) = ? Position 2 (0.00-1.00) (1.00) = ? Position 3 (0.00-1.00) (1.00) = Expected scores per codon in each frame 0.136 0.122 0.123 ? odd span length (31-101) (67) = ? plot interval (1-11) (5) = ? (y/n) (y) Plot relative scores Scaling values: Minimum maximum range 0.3121 0.3656 0.0382 ? (y/n) (y) Leave scaling values unchanged Graphics not shown ? Menu or option number=D43 Positional base preferences method to find protein genes Select standard source X 1 Use global standard 2 Use internal standard 3 Use codon usage table ? Selection (1-3) (1) =3 ? File name of standard=atpase.cods =========================================== F TTT 21. S TCT 33. Y TAT 15. C TGT 5. F TTC 55. S TCC 40. Y TAC 40. C TGC 4. L TTA 8. S TCA 7. * TAA 8. * TGA 0. L TTG 19. S TCG 12. * TAG 1. W TGG 17. =========================================== L CTT 22. P CCT 17. H CAT 6. R CGT 73. L CTC 21. P CCC 4. H CAC 30. R CGC 23. L CTA 1. P CCA 10. Q CAA 19. R CGA 5. L CTG 168. P CCG 48. Q CAG 80. R CGG 3. =========================================== I ATT 47. T ACT 14. N AAT 17. S AGT 8. I ATC 98. T ACC 54. N AAC 52. S AGC 26. I ATA 6. T ACA 7. K AAA 85. R AGA 0. M ATG 75. T ACG 13. K AAG 28. R AGG 0. =========================================== V GTT 67. A GCT 56. D GAT 41. G GGT 90. V GTC 29. A GCC 53. D GAC 66. G GGC 66. V GTA 49. A GCA 59. E GAA 101. G GGA 5. V GTG 57. A GCG 64. E GAG 41. G GGG 8. =========================================== Select normalisation X 1 Use observed frequencies 2 Combine with global standard ? Selection (1-2) (1) =2 T C A G Range 1 0.177 0.211 0.277 0.336 0.159 2 0.271 0.238 0.310 0.182 0.128 3 0.242 0.301 0.168 0.289 0.132 ? (y/n) (y) Use 1.0 for positional weights Expected scores per codon in each frame 0.785 0.736 0.736 ? odd span length (31-101) (67) = ? plot interval (1-11) (5) = ? (y/n) (y) Plot relative scores Scaling values: Minimum maximum range 0.3219 0.3519 0.0214 ? (y/n) (y) Leave scaling values unchanged Graphics not shown .end lit .left margIN1 @44. TX 6 @ Uneven positional base frequencies. .LEFT MARGIN2 .para Used to find regions of a sequence that might be coding for a protein. The method looks for sections of the sequence in which the frequency at which each of the four bases occupies the three positions in codons is nonrandom. The level of nonrandomness is plotted on a scale that shows the probability that the sequence is coding. At each position along a sequence the calculation gives the same value for all six possible reading frames, so only one value is plotted. .para Define the window length and plot interval. .para The results are plotted in a box divided by a horizontal line marked "76%". 76% of coding regions achieve values above this line and 76% of noncoding regions achieve scores below the line. .para This method, first described in Staden R. Nucl. Acid Res. 12 551-567 1984, looks for uneven positional usage of bases in codons. It looks through the sequence in one fixed phase and counts the number of times each base apears in each of the three codon positions: for each window position it counts A1,A2,A3 and C1,C2,C3 and G1,G2,G3 and T1,T2,T3 and calculates AMEAN=(A1+A2+A3)/3, and similarly CMEAN, GMEAN and TMEAN; it then calculates ADIF=abs(A1-AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly CDIF, GDIF and TDIF to measure the differences between an even base usage for all positions in the codons and the observed usage. The routine then calculates the sum ADIF+CDIF+GDIF+TDIF and plots this value on the following scale: the base level is such that no known window in a coding region has a lower value, whereas 14% of windows in noncoding sequences score below it. The top of the scale is not achieved by any known noncoding region, but is reached by 16% of known coding regions. The bar drawn across the plot corresponds to a level that is exceeded by 76% of windows in known coding regions but is reached by only 24% of windows in known noncoding regions. ie 76% of coding windows score above and 76% of noncoding windows score below. This is similar to Ficketts method but without the probabilities and weightings from the Los Alamos sequence library: it is therefore unbiased but may well give very similar results. .left margin1 @45. TX 6 @ Codon improbability on base composition .LEFT MARGIN2 .para Used to find regions of a sequence that might code for a protein. .para If dialogue is requested define a window length and plot interval. .para The idea of the method is, that of all sequence features that we know, it is only coding regions that will give rise to codon biases well above those expected from the base composition. If a region of sequence shows sufficiently strong codon bias then we conclude that it is coding for a protein. Using the multinomial distribution we have derived a function to measure the improbability of observing a set of codons from a sequence of the given composition. Using the Poisson distribution we have worked out the distribution of the improbability. The program plots the observed improbability minus the expected improbability (the mean as calculated from the Poisson distribution). The plots are presented against a scale of units of standard deviation as measured from the Poisson distribution. As with the other Staden and McLachlan method the program puts an extra point at a fixed level for the highest of the three probabilities; for this function this point is placed at six standard deviations above the mean expected level. The top of each plot corresponds to 12 standard deviations above the expected level and the bottom corresponds to the expected value. .para Analysis of the application of the method to the EMBL sequence library indicates that the method does work for most sequences and that the levels of improbability roughly correlate with levels of expression. Coding regions will show high peaks in all three frames making interpretation more difficult than for some of the other methods. .left margin1 @46. TX 6 @ Codon improbability on amino acid composition .LEFT MARGIN2 .para Used to finds regions of a sequence that might code for a protein. .para If dialogue is requested define a window length and a plot interval. .para The idea of the method is, that of all sequence features that we know, it is only coding regions that will give rise to codon biases such that, for each amino acid, some codons are used far more frequently than others. The method is independent of what the bias actually is, requiring only that it is present. If a region of sequence shows sufficiently strong codon bias then we conclude that it is coding for a protein. Using the multinomial distribution we have derived a function to measure the improbability of observing a set of codons from a sequence of the given composition. Using the Poisson distribution we have worked out the distribution of the improbability. The program plots the observed improbability minus the expected improbability (the mean as calculated from the Poisson distribution). The plots are presented against a scale of units of standard deviation as measured from the Poisson distribution. As with the other Staden and McLachlan method the program puts an extra point at a fixed level for the highest of the three probabilities; for this function this point is placed at six standard deviations above the mean expected level. The top of each plot corresponds to 12 standard deviations above the expected level and the bottom corresponds to the expected value. .left margin1 @47. TX 6 @ Shepherd RNY preference method .LEFT MARGIN2 .para Used to find regions of a sequence that might code for a protein. Based on the method of Shepherd (PNAS 78 1596-1600, 1981). .para If dialogue is requested define a window length and plot interval. .para Shepherd has found that many genes have a preference for the use of codons of the form RNY where R=purine, Y=pyrimidine and N=any base. He has attributed this to being due to remants of a primitive genetic code. The calculation is similar to that for the Staden and McLachlan method, the p1's being simply the number of RNY codons found in frame 1 etc and the P's being p/(p1+p2+p3). .left margIN1 @48. TX 6 @ Ficketts method .LEFT MARGIN2 .para Used to find regions of a sequence that might code for a protein. Based on the method of Fickett (Nucl. Acid Res.10 1982), but plots values for fixed window lengths rather than over the whole of open reading frames. .para If dialogue is requested define a window length and plot interval. The results are plotted in a box divided into three horizontal strips. .para Sections of the sequence with values plotted in the top strip of the box are adjudged to be coding, those in the middle strip "no decision", and those in the bottom "not coding". .para The program performs the following calculations: let A1 = the number of occurences of base A in position 1 of codons, A2 for position 2 etc. Similarly for bases C,G and T. For each window position calculate Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for C,G and T to give 4 positional values. Also count the base composition for the window to give Acomp, Ccomp etc. Fickett tested each of these 8 parameters singly as to their ability to distinguish coding from noncoding regions and arived at probabilities of coding for the range of values each can take = Pcod. He also measured their relative abilities and given weightings to each of the 8 parameters = Pw. To calculate the "TESTCODE" for a window we first lookup the Pcod for each of the calculated compositional and positional values and then calculate TESTCODE=sum(Pcod*Pw). TESTCODE is plotted relative to three levels of decision: the top division="coding", the middle="no opinion" and the bottom division="non coding". .left margin1 @49. TX 6 @ tRNA gene search. .LEFT MARGIN2 .para Used to find segments of a sequence that might code for tRNAs. Looks for potential cloverleaf forming structures and then for the presence of expected conserved bases. Presents results graphically or draws out the cloverleafs. .para If dialogue is requested a large number of parameters need to be given values, including some loop lengths, scores for each of the four stems, and scores for the conserved bases. .para The program was first described in Staden Nucl. Acid Res 817-825 (1980). The tRNA's that have been sequenced so far have two characteristics that can be used to locate their genes within long DNA sequences. Firstly they have a common secondary structure - the cloverleaf - and secondly, particular bases almost always appear at certain positions in the cloverleaf. The cloverleaf is composed of four base-paired stems and four loops. Three of the stems are of fixed length but the fourth, the dhu stem which usually has four base pairs, sometimes has only three. All of the loops can vary in size. The following relationships between the stems in the cloverleaf are assumed in the program: (a) there are no bases between one end of the aminoacyl stem and the adjoining tuc stem; (b) there are two bases between the aminoacyl stem and the dhu stem; (c) there is one base between the dhu stem and the anticodon stem; (d) there are at least three bases between the anticodon stem and the tuc stem. The program looks first for cloverleaf structure and then, if required, for conserved bases. The sizes of the loops, the number of basepairs in the stems and the required conserved bases may all be specified by the user. The process of looking for the presence of conserved bases can reduce the number of potential structures found considerably. The user may also specify that an intron may be present in the anticodon loop. .para The user may define a minimum number of base pairs for each stem using the scoring system G-C, A-T=2 and G-T=1 and scores for each of the conserved bases. Recommended values for the stem scores are given by the prompts and the percentage conservation of the conserved bases as found in the Nucl. Acid Res 1979 paper Gauss, Gruter and Sprinzl are also given, but the user must decide which bases are most likely to be conserved for the sequence being examined. The output shows the position of the possible gene in the sequence by a vertical line the height of which shows the number of basepairs made in the stems. The cloverleaf structure is also drawn but will scroll up off the screen. Output of the cloverleafs will look like: .lit 6942 A A-U A-U G-C A-U U-A A-U U-A AAU U UAUCU AA A !!!!! AAUG AUAGA A U !!!! U UCA C UUAC U AA A U-AA A A-U A-U C-G U-A U A U A GUC Typical dialogue follows. ? Menu or option number=D49 tRNA search ? Maximum trna length (70-130) (92) = ? Aminoacyl stem score (0-14) (11) = ? Tu stem score (0-10) (8) = ? Anticodon stem score (0-10) (8) = ? D stem score (0-8) (3) = ? Minimum base pairing total (30-32) (32) = ? Minimum intron length (0-30) (0) = ? Minimum length for TU loop (4-12) (6) = ? Maximum length for TU loop (6-12) (9) = ? (y/n) (y) Skip search for conserved bases n Give a score for each base, then a minimum total at the end ? Base 8, T is 100% conserved. Score (0-100) (0) = ? Base 10, G is 95% conserved. Score (0-100) (0) = ? Base 11, Y is 96% conserved. Score (0-100) (0) = ? Base 14, A is 100% conserved. Score (0-100) (0) = ? Base 15, R is 100% conserved. Score (0-100) (0) = ? Base 21, A is 97% conserved. Score (0-100) (0) = ? Base 32, Y is 100% conserved. Score (0-100) (0) = ? Base 33, T is 98% conserved. Score (0-100) (0) = ? Base 37, A is 91% conserved. Score (0-100) (0) = ? Base 48, Y is 100% conserved. Score (0-100) (0) = ? Base 53, G is 100% conserved. Score (0-100) (0) = ? Base 54, T is 95% conserved. Score (0-100) (0) = ? Base 55, T is 97% conserved. Score (0-100) (0) = ? Base 56, C is 100% conserved. Score (0-100) (0) = ? Base 57, R is 100% conserved. Score (0-100) (0) = ? Base 58, A is 100% conserved. Score (0-100) (0) = ? Base 60, Y is 92% conserved. Score (0-100) (0) = ? Base 61, C is 100% conserved. Score (0-100) (0) = ? Minimum total conserved base score (0-0) (0) = ? (y/n) (y) Plot results n Searching 306 C C-G C-G G-C T-A C-G A-T T+G AT A ATACA TTC T !!!! G CTGT TATGG G G ! ! T GA C TAAA C GCG C G T+GA C C-G C T T+G A T T-A G T T-A G A G G G C A A G A AGC T C A T C T A C T .end lit .left margIN1 .left margIN1 @50. TX 7 @ Plot start codons .left margin2 .para This function plots the positions of all start codons for each of the three reading frames. .left margin1 @51. TX 7 @ Plot stop codons .left margin2 .para This function plots the positions of all stop codons for each of the three reading frames. .left margIN1 @52. TX 7 @ Plot stop codons on the complementary strand .left margin2 .para This function plots the positions of all stop codons for each of the three reading frames on the complementary strand. .left margin1 @53. TX 7 @ Plot stop codons on both strands .left margin2 .para This function plots the positions of all stop codons for each of the three reading frames on both strands. .left margin1 @54. TX 5 @ Search for longest open reading frames .left margin2 .para This function will report the positons of the ends of all sections of sequence that contain no stop codons. All six reading frames are examined. Results are presented in the form of an EMBL feature table. Hence if the results are stored in a file by use of "direct output to disk", the file can be used to translate the open reading frames in a sequence. Note that in order for the file to be used as a feature table it must include either EMBL or GenBank headers, and a suitable "tail". The simplest header is the word FEATURES starting in column 1 of the first line of the file. The simplest tail is 2 empty lines at the end of the file. These lines are not included when nip writes out results in feature table format. .para Define the minimum length of open reading frame to report (in amino acids). Choose to search either or both strands. The program displays the end points, the reading frame and strand. .para Typical dialogue follows. .lit ? Menu or option number=D54 Find open reading frames ? Minimum open frame in amino acids (5-1000) (30) =100 X 1 + strand only 2 - strand only 3 Both strands ? 0,1,2,3 =3 FT CDS 1 831 1 831 FT CDS 1540 2853 1 1314 FT CDS 3130 4242 1 1113 FT CDS 5761 6114 1 354 FT CDS 6187 6711 1 525 FT CDS 1766 2077 2 312 FT CDS 2078 2446 2 369 FT CDS 4136 5500 2 1365 FT CDS 1335 1637 3 303 FT CDS 2844 3194 3 351 FT CDS 6819 7238 3 420 FT CDS 2073 1711 C 1 363 FT CDS 2469 2149 C 1 321 FT CDS 6542 6144 C 3 399 .end lit .left margin1 @55. TX 8 @ Search for E. coli promoter (general) .LEFT MARGIN2 .para Searches for E coli promoter like sequences using a standard weight matrix. The positions of the matches are plotted. No dialogue is required. .para The method was first described in Staden R. Nucl. Acid Res. 12 505-519 1984. This search uses a weight matrix taken from the frequency tables contained in Hawley, D. K. and McClure, R., nar 11 2237-2255 (1983). The weight matrix is divided into 3 sections that are separated by varying sizes of gap: the - 35 region, the -10 and the +1 region. The algorithm first looks for a sufficiently good -35 region, then for the best -10 region within range and then for the best +1 region within range of the -10; each separate region must score above the lowest known score for the corresponding section. The gap penalty is then applied and two plots produced: one with gap penalties, one without. Scaling is such that no known promoter scores below the bottom level and no known promoter scores above the top level when the weight matrix is applied. .para Two other functions also look for E. coli promoters: 92 looks for sites on the complementary strand and 93 looks for individual -35 and -10 regions and plots them on a scale such the top is the highest known value +10% and the bottom is the lowest known -10% .LEFT MARGIN1 .lit weights for E. coli promoters -35 region: P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26 107109109110110110110110110111111110111112112112112112112112112112112112112 T 41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34 C 22 27 18 29 20 14 20 12 22 23 16 25 10 43 7 6 11 18 60 8 25 23 23 17 20 A 28 38 30 37 35 56 42 42 37 42 39 18 25 26 2 6 2 72 26 50 26 34 25 26 31 G 16 11 29 19 21 18 13 21 9 19 24 26 29 29 11 6 88 3 11 17 15 21 26 21 27 -10 region: P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5 112112112112112112112112112112112112112112112112112112112 T 35 28 28 27 39 51 34 43 26 31 89 3 49 15 19108 31 29 21 C 34 21 24 27 12 25 20 25 20 27 10 2 16 14 22 3 13 16 30 A 20 39 33 33 39 23 29 16 23 19 2106 29 66 57 1 35 23 31 G 23 24 27 25 22 13 29 28 43 35 11 1 18 17 14 0 33 24 30 + region: P -2 -1 1 2 3 4 5 6 7 8 9 10 86 88 85 88 88 88 88 88 88 88 88 88 T 16 22 2 42 27 23 20 25 27 15 16 29 C 29 49 4 25 25 13 18 22 17 17 16 17 A 20 9 45 16 24 25 28 24 24 32 35 26 G 21 8 37 5 12 27 22 17 20 24 21 16 .end lit Notes: E. coli promoters have been shown to contain 2 regions of conserved sequence located about 10 and 35 bases upstream of the transcription startsite. These are TATAAT and TTGACA with an allowed spacing of 15 to 21 bases between. The spacing with maximum efficiency was 17 bases and all but 12 of the 112 sequences could be aligned with a separation of 17 +or-1 bases. The standard promoter has spacing 7 and 17 bases between the startsite and the -10 region, and the -10 and -35 regions, respectively. The spacing between the -10 region and the startsite is usually 6 or 7 bases but varies between 4 and 8 bases. There is an AT rich region of 8 to 10 bases upstream of the -35 region. Iniation with a purine is highly prefered with G being used if A is not present. .lit Gap penalties: 15 0.02 (only exists as mutant) 16 0.2 17 1.0 18 0.2 19 0.05 (guess) 20 0.02 (guess) 21 0.01 (guess) .end lit .left margin1 @56. TX 8 @ Search for E. coli promoter (general) strand .LEFT MARGIN2 .para This function searches for E. Coli promoters on the complementary strand of the sequence. See the notes on option 55. .left margin1 @57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10) .LEFT MARGIN .para This function searches separately for the -35 and -10 sequences of an E. coli promoter. See the notes on option 55. .left margIN1 @58. TX 8 @ Search for procaryotic ribosome binding sites .LEFT MARGIN2 .para This function searches for the 5' ends of prokaryotic genes using an unusual weight matrix. The search is relatively slow because the matrix is 101 bases in length. No dialogue is required. .para The method was first described in Staden Nucl. Acid Res. 12 505-519 1984. This actually looks for more than a ribosome binding site as is explained below. This uses their weight matrix w101 of Stormo and Schneider (NAR 10 2971-3024, 1982) which with a value of 2 finds all gene starts in their library. .LEFT MARGIN1 .lit P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36 T 5 1 -3 9-14 7 15 -5 3-16-17 4 18 5 -3 -1 2 4 5 -5 7 8 -5-15 6 C-21 -6-11-21 0 8 -7-12 -1 1 0-19 12 -3 -1 10 2 -8 -5-11 8 1 23 6 -5 A 7 -2 13 -2 -8-13-18 5 0 -5 13 8-15 9 -4 -7 9 0 -8-11-10 -6 -7 -5 -6 G -6 -9 -7 0 8-16 -4 -2-16 1 -4 8-14 5 11-13-24 3 7 22-11 -9-15 10 -4 P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11 T 3 4 16 -4 7 11 -4 -1 12 8 10 -1 1 8 2-10-16 11 1 -3 16 -3-36 -8-27 C 2-14 -3 -8-10-21 2 0 -2 -1-11 -3 -1 5-11 -4 7 0-14 6 -8-20 -7-36-44 A-12 -1-27 -3 -6 0-12 -3 -4 -7 14 -2 -4 -6 0 12 5 -9 0-11-11 10 8 2 8 G 4 -5 -6 -3 -1 -4 -1 -4-15 0-14 3 10-19 -3-10 -7 -7 7 1 -8 -6 15 21 42 P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 T-53-27-26-23 2 -7-14-40-28 0-53 75-62-20-40-10-35 -5-12 -1 4 14-23 7 -2 C-15-50-43-35-38-29-29 1 -9 1-87-55-64-45 11-22-14-20-15-15-10-22 -5 2 6 A 0 -3 -5 4-20-11 5 6 -2-15 66-69-52 -5 -4 6 8-24 -7-10 -7 13 14 -9-18 G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15 P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 T-26 1 4 -7 3 -4 0-10 8-18 7-22-21 8 4 -3 -6 7 -8 1 -5-16-16 7 -6 C 6 -8 19 -7 9 -3 17 -2 3 -9 5 22 22 8 -1 1 18 6 11-10 -8 7 10 0 7 A 14-12-42 1 -5 -4-32 12-10 20 -6 -1 3 -4 4-10 -1 -2-14 11 14 -3 2-13 5 G-23 -7 -1 -6-17 -4 0-15-14 -4-17-10 -5-13 -8 10-13-13 9 -4 -3 10 2 4 -8 P 40 T 0 C 14 A 5 G-21 .END LIT These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl. Acid Res. 10 2997- 3011, 1982. They report that this matrix gives a score of at least 2 for all gene starts in their library whereas all other sequences score 1 or less. .left margin1 @29. TX 1 @ Reverse and complement the sequence .LEFT MARGIN2 .para Reverses and complements the current active region of the sequence. .left margin1 @60. TX 7 @ Search using a dinucleotide weight matrix .LEFT MARGIN2 .para This function performs searches for short sequence motifs using an appropriate dinucleotide weight matrix. In addition it can be used to create or modify weight matrices. In order to perform a search the only input required is the name of the file containing the weight matrix. The results can be presented graphically or listed. The graphical presentation will draw line at the position of any matches found; the height of the line is proportional to the score. The method is identical to that using weight matrices derived from nucleotide frequencies, except that here we use the frequencies of dinucleotides. .para For a search, select "use weight matrix", supply the name of the file containing the weight matrix, and choose between having results plotted or listed. If dialogue is requested when the function is selected users can alter the cutoff score employed. .para To create a weight matrix several steps are involved. A file containing an alignment of known motifs is required. (This file must be created before the current option is selected. The format is a follows: each sequence is written on a separate line with at least one space at the beginning; each sequence is terminated by a space character, and can be followed by a name. The sequences must be aligned.) Supply the name of the file of aligned sequences. The program reads and displays the sequences. Choose between "summing logs of weights" or summing weights (i.e. whether to multiply or add weights). If logs are used all scores will be negative. Choose if all positions in the set of aligned sequences should be used or if a mask should be applied. If so selected, define a mask as a string of symbols, in which symbol - means ignore and any other symbol means use. E.g. xx-x--abc means use all positions except 3,5 and 6. .para The program will calculate weights as the frequencies of the dinucleotides at each unmasked position in the set of aligned sequences. These weights are then applied to the set of aligned sequences to give a range of "observed" scores. The mean and standard deviation of these scores is displayed. The user is asked to supply several values to be used when the weight matrix is applied to other sequences: a cutoff score (by default, the mean minus 3 standard deviations), a top score for scaling graphical results (by default, the mean plus 3 standard deviations), and a position to identify (this means that if a particular base within the motif is used as a "landmark", such as the A of the AG in splice acceptor sites, then its position will be marked in plots). All these values are stored along with the weight matrix. Finally supply the name of a file to contain the weight matrix. .para Weight matrices can be "rescaled" using a set of aligned sequences in much the same ways as a matrix is created. The purpose is to redefine the cutoff scores, and rescaling does not alter any other values in the weight matrix file. .para The methods have always had to deal with the problem of zeroes in the matrices. The current versions employ "Laplaces Law of Succession" in which 1 is added to each term. .lit Typical dialogue follows. ? Menu or option number=D60 Motif search using dinucleotide weight matrix X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 = 2 ? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ 1 AGCGTGACTCTTCCCGGAA HIS1 2 GAGGTGACTCACTTGGAAG HIS1 3 CGGATGACTCTTTTTTTTT HIS3 4 ACAGTGACTCACGTTTTTT HIS4 5 GTCGTGACTCATATGCTTT ARG3 6 TGAATGACTCACTTTTTGG ARG4 7 TTCTTGACTCGTCTTTTCT CPA1 8 CGAATGACTCTTATTGATG CPA2 9 AGAATGACTAATTTTACTA TRP5 10 TCGTTGACTCATTCTAATC TRP3 11 TTGCTGACTCATTACGATT TRP2 12 GAGATGACTCTTTTTCTTT IV1 13 GCGATGATTCATTTCTCTG IV2 14 TAGATGACTCAGTTTAGTC LEU1 15 TAAGTGACTCAGTTCTTTC LEU4 16 ATGATGACTCTTAAGCATG ILS1 Length of motif 18 ? (y/n) (y) Sum logs of weights n ? (y/n) (y) Use all motif positions n x means use, - means ignore e.g. xx-x---x-x means use positions 1,2,4,8,10 ? Mask=----XXXXXXXX-------- Applying weights to input sequences 1 89.000 AGCGTGACTCTTCCCGGA 2 91.000 GAGGTGACTCACTTGGAA 3 93.000 CGGATGACTCTTTTTTTT 4 90.000 ACAGTGACTCACGTTTTT 5 94.000 GTCGTGACTCATATGCTT 6 91.000 TGAATGACTCACTTTTTG 7 81.000 TTCTTGACTCGTCTTTTC 8 90.000 CGAATGACTCTTATTGAT 9 75.000 AGAATGACTAATTTTACT 10 97.000 TCGTTGACTCATTCTAAT 11 97.000 TTGCTGACTCATTACGAT 12 93.000 GAGATGACTCTTTTTCTT 13 69.000 GCGATGATTCATTTCTCT 14 90.000 TAGATGACTCAGTTTAGT 15 90.000 TAAGTGACTCAGTTCTTT 16 90.000 ATGATGACTCTTAAGCAT Top score 97.000 Bottom score 69.000 Mean 88.750 Standard deviation 7.319 Mean minus 3.sd 66.794 Mean plus 3.sd 110.706 ? Cutoff score (-999.00-9999.00) (66.79) = ? Top score for scaling plots (66.79-999.00) (110.71) = ? Position to identify (0-18) (1) = ? Title=GCN4 DI WTS ? Name for new weight matrix file=3.WTS ? Menu or option number=D60 Motif search using dinucleotide weight matrix X 1 Use weight matrix 2 Make weight matrix 3 Rescale weight matrix ? 0,1,2,3 = ? Motif weight matrix file=3.WTS GCN4 DI WTS ? Cutoff score (-9999.00-9999.00) (66.79) =40 ? (y/n) (y) Plot results n 15 42.00 CAACCCGCTCACCGACAA 29 42.00 ACAACAGCTCACCCACGC 93 46.00 AGCCTTCCTCATCGCTGC 153 40.00 CAGCGGAATCAAACTTAA 408 42.00 CGATGGATTCAAGTTGAA 469 47.00 TTAGGAACTCCCTCTGTC 493 60.00 AAGCTGAATCTTAGCAGC 530 43.00 CGGAGGGCTCAGTGAGGG 542 47.00 TGAGGGACTACTGCACCA 678 41.00 CTTCTGCTTCAAAGAGTT 709 47.00 AATATGACGGCGCACGTG 848 54.00 GTCAGAACTCAAATCAGT 940 49.00 CCGTTGACGACCTCCGCA 992 42.00 TGGGCACCTCACACCAAG .end lit .left margIN1 @61. TX 8 @ Search for eukaryotic ribosome binding sites .LEFT MARGIN2 .para Searches for eukaryotic ribosome binding sites using weightings derived from Sargan,Gregory,Butterworth febs let 147 133-136 1982. No dialogue is required. First described in Staden Nucl. Acid Res. 12 505-519 1984. .LEFT MARGIN1 .lit mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET 147 133-136 1982 P -7 -6 -5 -4 -3 -2 -1 1 2 3 102102102102102102102102102102 T 19 24 31 12 0 18 5 0102 0 C 20 15 32 65 5 42 52 0 0 0 A 50 27 27 19 86 36 34102 0 0 G 6 29 12 6 11 6 11 0 0102 VIRAL ONLY P -7 -6 -5 -4 -3 -2 -1 1 2 3 41 41 41 41 41 41 41 41 41 41 T 14 12 16 4 2 13 9 0 41 0 C 7 3 13 17 7 9 14 0 0 0 A 15 10 6 10 27 15 9 41 0 0 G 5 16 6 10 5 4 9 0 0 41 .END LIT The Sargan et al paper puts forward the hypothesis that there is an interaction between some mRNA leader sequences and a highly conserved structure in the 18S rRNA of eukaryotic ribosomes. The attempt to substantiate the hypothesis includes a table of base frequencies for sequences immediately 5' to start codons. They examined 102 sequences and I have used the base frequencies they found as a weight matrix for searching for eukaryotic gene starts. I don't yet know how good this method is. The viral sequences were found to be slightly different but the separate table shown here is not used in the program. .left margin1 @62. TX 8 @ Search for splice junctions .LEFT MARGIN2 .para Used to search for mRNA splice junctions using a weight matrix. The default weight matrix is still that derived from the paper of Mount (Nucl. Acids Res. 10, 459-472). However users may employ their own tables. By default the positions of possible junctions will be plotted rather than listed. The diagram splits the donor plot into 3 horizontal boxes so that all the sites marked in any box are from the same reading frame. The acceptor plot appears above the donor plot and is split in an equivalent way. So sites marked as donors and acceptors in equivalent boxes are compatible. i.e. donors from donor box 1 are compatible with acceptors from acceptor box 1, etc. Of course it is the combination of reading frame and splice sites that really matters, and donors from box 1 can be compatible with acceptors in box 3 if the reading frame switches. .para If dialogue is selected users can employ their own file of weights (see below for the format), can change the cutoff scores, and can elect to have the results listed rather than plotted. Listed results show the position (of the last or first base in the exon), the frame and the matching sequence. The frequency table shown below is used as a default weight matrix and AG and GT are obligatory at the appropriate positions. The plots are scaled so that the top of scale is the highest value achieved by a junction sequence in the set used to compile the frequency table, and the bottom of the scale is the lowest value achieved by a junction sequence in the set used to compile the frequency table. .para In the light of current knowledge it would be sensible for users to use the weight matrix search option (20) to create matrices that define more specific splice junctions. If so it is important that the positions "marked" are the last base in the donor exon and the first base in the acceptor exon. To make a weight matrix suitable for use with this function follow the instructions for option 20 and create files for both donor and acceptor sites. Then concatenate the two matrix files with the donor file first. Note that any positions in the weight matrix that are 100% conserved will be made obligatory (normally the AG and GT). .LEFT MARGIN1 .lit Mount donors redone 16-4-91 12 3 -16.085 -7.500 P -2 -1 0 1 2 3 4 5 6 7 8 9 N 136 136 136 136 136 136 136 136 136 136 136 136 T 28 8 15 17 0 136 9 16 7 84 30 36 C 41 60 16 7 0 0 3 13 3 17 28 39 A 40 56 89 12 0 0 83 91 12 23 53 33 G 27 12 16 100 136 0 41 16 114 12 25 28 Mount acceptors redone 16-4-91 18 15 -26.142 -14.400 P -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 T 58 50 57 59 67 56 58 49 47 66 64 31 34 0 0 11 41 31 C 21 28 34 25 29 33 35 32 42 40 33 25 74 0 0 23 28 41 A 17 11 11 18 7 17 12 23 15 3 10 29 5 113 0 24 21 21 G 17 24 11 11 10 7 8 9 9 4 6 28 0 0 113 55 23 20 .END LIT .left margIN1 @63. TX 7 @ Search using a weight matrix (complementary) .LEFT MARGIN2 .para This function searches the complementary strand of the sequence using a weight matrix. Many motifs can bind to either strand of the DNA and this function allows users to search the complementary strand without having to change the orientation of the sequence. See option 20 for more details. .left margin1 @64. TX 3 @ Plot observed-expected word frequencies .LEFT MARGIN2 .PARA This option is designed to examine the abundances of short words in a sequence to see if particular ones are either under or over represented. It compares the observed and expected frequencies and plots them along the sequence. There has been some work on the relative amounts of CG dinucleotides in eukaryotic sequences (eg Bird, Nature 321, 209-213 (1986)) and this new routine can be used to examine such biases, or any others that might be interesting. .para The user selects a word - say CG -, a window length, and a maximum and mininum scale for plotting the results. The program examines each sucessive window length along the sequence, with each window overlapping the previous one by windowlength-1. The program counts the base frequencies in each window, and the number of occurrences of the chosen word within the window. Using the base frequencies it calculates an expected number of occurrences for the chosen word (simply by multiplying the relevant frequencies). It plots observed-expected, and hence will show regions that are rich or depleted in the chosen word. The longest allowed word is 9 characters, but the calculation of the expected frequencies becomes less appropriate as the word length increases above 2. .para Typical dialogue follows. .lit ? Menu or option number=D64 Plot composition differences (obs-exp)) Default String=CG ? String= ? odd span length (3-401) (101) = ? plot interval (1-20) (5) = ? Maximum plot value (-6.31-25.25) (6.31) = ? Minimum plot value (-25.25-6.31) (-6.31) = Missing graphics display here .end lit .left margIN1 @65. TX 9 @ Search for polya sites .LEFT MARGIN2 .para Simply searches for the sequence AATAAA (Proudfoot and Brownlee Nature 263, 211-214, 1982) and marks it with a short vertical line. .left margin1 @66. TX 1 @ Interconvert t and u .LEFT MARGIN2 .para This function interconverts T and U characters in the active sequence i.e between DNA and RNA. .LEFT MARGIN1 @67. TX 7 @ Search for patterns of motifs .left margin2 .para This option searches for patterns of motifs. Patterns can be defined interactively or read from files. Results can be displayed in several ways in both graphical and textual form. Used to create pattern files for searching libraries. The option is extremely flexible and consequently the following documentation is quite lengthy. However the routine is capable of searching for almost any known pattern. In addition the flexibility does not necessitate difficulty of use, and the userinterface has been simplified considerably since the methods were first published. .para Users should refer to the "typical dialogue" shown below for the most helpful information on using the program. .para There are currently four ways to display the matching patterns: 1=each individual motif and its position is listed; 2=all the sequence between, and including the two outermost motifs is listed; 3=graphical, with a vertical line marking the position of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM field if the motif name, the FROM and TO fields denote the ends of the match, and the DESCRIPTION field is "Program". .para When it is defined for the first time a pattern must be entered interactively at the keyboard, but the pattern description can be saved to a file. This file can be used for all subsequent searches. .para When defining a pattern interactively select a motif class and the program will request the required inputs. .para The program gives each motif an identifying name and number. For motifs other than the first, a range of allowed positions must be defined (Note that sets of motifs included using the OR operator will all be given the same range, and so the program will only request range values for the first motif in any such set). To specify the allowed range for a motif the user must supply the following: the identifying number of the motif, relative to which the current motifs positions are to be defined (termed the "reference motif"); a "relative start position" and a range. The relative start position can be negative or positive. A negative start position means that although the reference motif is searched for first, the current motif can be found to its left. A zero relative start position means their left ends are superimposed. The default start position is to butt-joint the motif to righthand end of the "reference motif". The range is "the number of extra positions" that the motif can take. .para The program will display the probability of finding each motif. These values are presented in the following form: .1234E-5 means 0.1234 times 10 to the power -5. .para After the pattern has been defined, the program will type a description of it on the screen. It will then allow the user to give an overall cutoff score and overall probability cutoff. .para Typical dialogue for all the different motif classes is displayed below. .lit ? Menu or option number=67 Pattern searcher ? (y/n) (y) Read pattern from keyboard X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 = ? Motif name=Ematch ? String=AA Probability of score 2.0000 = 0.595E-01 X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =2 ? Motif name=AAA X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-1) (1) = ? Relative start position (-1000-1000) (3) = ? Number of extra positions (0-1000) (0) = ? string=AAA ? Minimum matches (1.00-3.00) (3.00) =2 Probability of score 2.0000 = 0.149E+00 1 Exact match X 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =3 ? Motif name=T'S X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-2) (2) = ? Relative start position (-1000-1000) (4) = ? Number of extra positions (0-1000) (0) = ? String=TTT ? Minimum score (0.00-108.00) (108.00) =72 Probability of score 72.0000 = 0.258E+00 1 Exact match 2 Percentage match X 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =4 ? Motif name=GCN4 X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-3) (3) = ? Relative start position (-1000-1000) (4) = ? Number of extra positions (0-1000) (0) = ? Weight matrix file name=GCN4 GCN4 FROM WEIGHTS 17-11-87 Probability of score -22.0020 = 0.139E-02 1 Exact match 2 Percentage match 3 Cut-off score and score matrix X 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =5 ? Motif name=GCN4 X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-4) (4) = ? Relative start position (-1000-1000) (20) = ? Number of extra positions (0-1000) (0) = ? Weight matrix file name=GCN4 GCN4 FROM WEIGHTS 17-11-87 Probability of score -22.0020 = 0.606E-03 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix X 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =6 ? Motif name=LOOP X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-5) (5) = ? Relative start position (-1000-1000) (20) = ? Number of extra positions (0-1000) (0) = ? Stem length (1-60) (6) = ? Minimum loop length (-6-60) (0) = ? Maximum loop length (0-60) (0) =5 ? Minimum score (1.00-12.00) (12.00) =10 Probability of score 10.0000 = 0.598E-02 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix X 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =7 ? Motif name=Tstep X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-6) (6) = ? (y/n) (y) Relative to 5 prime end ? Relative start position (-1000-1000) (1) = ? Number of extra positions (0-1000) (0) = ? String=TTT ? Step (1-20) (3) = Probability of score 3.0000 = 0.367E-01 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop X 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =8 ? Motif name=REPEAT X 1 And 2 Or 3 Not ? 0,1,2,3 = ? Number of reference motif (1-7) (7) = ? Relative start position (-1000-1000) (4) = ? Number of extra positions (0-1000) (0) =2 ? Repeat length (1-60) (6) = ? Minimum gap (0-60) (0) = ? Maximum gap (0-60) (0) =4 ? Minimum score (1.00-6.00) (6.00) =5 Probability of score 5.0000 = 0.554E-02 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step X 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =9 ? (y/n) (y) Save pattern in a file N Pattern description Motif 1 named Ematch is of class 1 Which is an exact match to the string AA Motif 2 named AAA is of class 2 which is a match of score 2. to the string AAA and the 5 prime base can take positions 3 to 3 relative to the 5 prime end of motif 1 It is anded with the previous motif. Motif 3 named T'S is of class 3 which is a match of score 72. to the string TTT and the 5 prime base can take positions 4 to 4 relative to the 5 prime end of motif 2 It is anded with the previous motif. Motif 4 named GCN4 is of class 4 Which is a match to a weight matrix with score -22.002 and the 5 prime base can take positions 4 to 4 relative to the 5 prime end of motif 3 It is anded with the previous motif. Motif 5 named GCN4 is of class 5 Which is a match to the complement of a weight matrix with score -22.002 and the 5 prime base can take positions 20 to 20 relative to the 5 prime end of motif 4 It is anded with the previous motif. Motif 6 named LOOP is of class 6 Which is a stem-loop structure with stem length 6 and score 10. The loop can have sizes 0 to 5 and the 5 prime base can take positions 20 to 20 relative to the 5 prime end of motif 5 It is anded with the previous motif. Motif 7 named Tstep is of class 7 Which is an exact match to the string TTT with a step size of 3 and the 5 prime base can take positions 1 to 1 relative to the 5 prime end of motif 6 It is anded with the previous motif. Motif 8 named REPEAT is of class 8 Which is a repeat with repeat length 6 and score 5. The loop-out can have sizes 0 to 4 and the 5 prime base can take positions 4 to 6 relative to the 5 prime end of motif 7 It is anded with the previous motif. Probability of finding pattern = 0.2348E-14 Expected number of matches = 0.5100E-09 ? Maximum pattern probability (0.00-1.00) (1.00) = ? Minimum pattern score (-9999.00-9999.00) (-9999.00) = Select display mode X 1 Motif by motif 2 Inclusive 3 Graphical 4 EMBL feature table ? 0,1,2,3,4 =4 Searching Total matches found 0 Menus and their numbers are m0 = This menu m1 = General m2 = Screen control m3 = Statistical analysis of content m4 = Structures and repeats m5 = Translation and codons m6 = Gene search by content m7 = Prokaryotic signal search m8 = Eukaryotic signal search ? = Help ! = Quit ? Menu or option number=67 Pattern searcher ? (y/n) (y) Read pattern from keyboard X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 = ? Motif name=Arun ? String=AAAAAA Probability of score 6.0000 = 0.210E-03 X 1 Exact match 2 Percentage match 3 Cut-off score and score matrix 4 Cut-off score and weight matrix 5 Complement of weight matrix 6 Inverted repeat or stem-loop 7 Exact match, defined step 8 Direct repeat 9 Pattern complete ? 0,1,2,3,4,5,6,7,8,9 =9 ? (y/n) (y) Save pattern in a file N Pattern description Motif 1 named Arun is of class 1 Which is an exact match to the string AAAAAA Probability of finding pattern = 0.2103E-03 Expected number of matches = 0.1522E+01 ? Maximum pattern probability (0.00-1.00) (1.00) = ? Minimum pattern score (-9999.00-9999.00) (-9999.00) = Select display mode X 1 Motif by motif 2 Inclusive 3 Graphical 4 EMBL feature table ? 0,1,2,3,4 =4 Searching FT Arun 1582 1587 Program FT Arun 3160 3165 Program FT Arun 4204 4209 Program FT Arun 5691 5696 Program FT Arun 6710 6715 Program Total matches found 5 Minimum and maximum observed scores 6.00 6.00 .end lit .para These methods allow users to define and search for complex patterns of motifs defined as single objects. The programs allow individual DNA motifs to be defined in eight different ways, and protein motifs in six. Motifs are combined, using the logical operators AND, OR and NOT, to describe a pattern. The pattern also specifies the ranges of allowed relative separations of the individual motifs. .para First some definitions. .para A MOTIF is a contiguous subsequence of fixed length. At its simplest it could be a single definite base or amino acid; a more complex motif might be better represented as a consensus or a weight matrix; two more-abstract types of motif are direct and inverted repeats. .para A PATTERN is a higher order of structure defined by a list of motifs. The motifs in a pattern are combined using the logical operators AND, OR and NOT. The list also defines the allowed relative separations of the motifs. In the current versions of the programs up to 50 motifs can be combined into a single pattern. So using these definitions there are two differences between motifs and patterns: 1) the distances between all elements of a motif are fixed, but the separations of parts of patterns can vary; 2) all characters in a motif are defined using the same method (class), but different parts of a pattern can be defined in completely different ways. .para Each motif can be represented in 9 ways (known as the motif class): .sk1 .lit MOTIF CLASSES CLASS DESCRIPTION 1 Exact match to a short defined sequence. The IUB symbols can be used for DNA sequences. 2 Percentage match to a defined short sequence. In nucleic acids, the IUB symbols can be used. 3 Match to a defined sequence, using a score matrix and cutoff score. The DNA matrix (see option 18) gives scores to IUB symbols depending on their level of redundancy. MDM78 is used for proteins. 4 Match to a weight matrix with cutoff score. 5 As class 4 but on the complementary strand. 6 Inverted repeat or stem-loop. Fixed stem length, range of loop sizes, and cutoff score using A-T, G-C=2; G-T=1. 7 Exact match to short sequence but with a defined step size. 8 Direct repeat. Fixed repeat length, range of loop-out sizes, cutoff score, and score matrix (for protein sequences MDM78 and for nucleic acids an identity matrix). 9 Membership of a set. A list of sets of allowed amino acids for each position in the motif. The sets are separated by commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino acids in which one of I,V or L must be found in the first position, then anything in the next two positions, D,E,K or R in the fourth position and F,Y,W,I,L,V or M in the fifth. This class only applies to protein sequences because for nucleic acids "membership of a set" can be achieved using IUB symbols. Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to nucleic acids. .end lit .para Class 1: exact match. .para The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. All symbols must match. .para Class 2: percentage match .para The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. The minimum number of matching characters must also be specified. .para Class 3: match using a score matrix .para The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. The motif is not compared directly with the sequence to count the number of matching characters. Instead a matrix is used to provide a score for all possible pairs of characters. The motif score for any position along the sequence is the sum of the scores found by looking-up the scores for each pair of aligned characters. A match is declared if some minimum score is achieved. .para Class 4: weight matrix .para The motif is defined by a table of values (called weights or scores). The table gives a score for finding each possible character at each position along the length of the motif. It therefore has dimension motif-length x character-set-size, and allows us to give different scores for each character at each position. It is equivalent to having a different score matrix for each position along the motif, and provides the most flexible and specific method of defining motifs. The weight matrices are created by program NIP option 20 and stored as files. The file contains the values for each position, as well as an overall minimum score. There are two ways in which these values can be used to calculate an overall score for any section of the sequence. The simplest way is to add the values in the file. (This means that the highest possible score can be calculated by adding the top value at each column position, and the lowest by adding the bottom value.) The normal way of using the values in the file is as follows. First the programs divide the values in each column by the column total so that they sum to 1.0 Then the natural logs of these values are used as scores. When the matrix is applied to a sequence these logarithmic values are summed (which is of course equivalent to multiplying the frequencies). Note that using the natural logs of the frequencies as weights and adding them means that the overall cutoff score must be less than zero, whereas if the original values in the weight matrix file are added, the cutoff score will be greater than zero. The search routines therefore decide whether the user wants to add values or multiply frequencies by examining the value of the cutoff score: it will add if the cutoff is greater than zero and add logs of frequencies if it is less than zero. Hence we effectively get two motif classes in one. The program NIP, when creating weight matrix files, will ask the user whether the scores should be added or multiplied. If the values in the table have been defined without using a set of aligned sequences it is easier for the user to choose a cutoff score if the values are added. .para Class 5: complement of weight matrix .para The motif is defined by a weight matrix, but the program searches for its complement. .para Class 6: inverted repeat, or stem-loop .para The motif is defined by a repeat length, a minimum score and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0. The loop sizes are defined by a minimum and maximum distance from the 3' end of the stem. For a stem-loop these will be positive numbers. For example to define a stem of length 8 and loop sizes varying from 3 to 5, the stem would be set to 8, the minimum start distance to 3 and the maximum to 5. To define an inverted repeat the minimum distance will be negative. For example stem length=9, minimum distance=-9, and maximum distance=-8 will find inverted repeats of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base at its centre, the second having none. .para Class 7: exact match, defined step size. .para The motif is defined by a short sequence, which for nucleic acids, may include IUB symbols. All symbols must match. The class differs from class 1 in that searches will move in steps of some given size. For example we could search for a certain codon and use a step size of 3 and hence keep in a single reading frame. .para Class 8: direct repeat .para The motif is defined by a repeat length, a minimum score and a range of loop sizes. The scores are defined using MDM78 for protein sequences and an identity matrix for nucleic acids. The loop sizes are defined by a minimum and maximum distance from the 3' end of the stem. .para Class 9: membership of a set .para This motif class is for protein sequences. It is defined by lists of allowed amino acids for each position in the motif, and a cut-off score. Positions at which any amino acid can occur are left blank. All allowed amino acids for each position give a score of 1. The motifs can be defined in two ways: either typed at the keyboard or read in as a weight-matrix-like file. When the motif is defined at the keyboard the sets of allowed amino acids are separated by commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino acids in which one of I,V or L must be found in the first position, then anything in the next two positions, D,E,K or R in the fourth position and F,Y,W,I,L,V or M in the fifth. To specify that the whole motif must match a score of 3 would be required (i.e. one of the allowed amino acids must be found for each of the three defined positions). If the motif is read from a file the file must have been written by program NIP, or have been saved by the pattern searching routines. If the user elects to save a pattern, and it includes class 9 motifs typed at the keyboard, then the program will save the class 9 motifs as weight matrix files. Therefore it will request file names for each motif of this class. If the motif given above as an example were saved the weight matrix file would have 5 columns. The first column would contain zeroes except for the I, V and L rows which would be set to 1; the next two columns would all be zero; the next would be zero except for the D,E,K and R rows which would be 1; the final column would contain 1's in rows F,Y,W,I,L,V and M, with the rest zero. .para The logical operator (AND, OR or NOT) used to add each motif to the pattern is specified by preceding the class number by the letters A, O or N. A = AND, O = OR, N = NOT. The default is A, so N2 means include, using the NOT operator, a class 2 motif; O2 means include, using the OR operator, a class 2 motif; both A2 and 2 mean include, using the AND operator, a class 2 motif. .para Range setting. .para The motifs in a pattern are numbered according to their order in the list. Apart from the first motif in a pattern all motifs are given a range of allowed positions relative to a motif further up the list. For example suppose we have a pattern defined by A AND B AND C AND D. Motif A can occur anywhere, but B must have its range of allowed positions defined relative to the position of motif A, and C's positions can be defined relative to either A or B, depending on which is most convenient, and likewise D's positions can be relative to A or B or C. .para Notice that the positions of motifs can be defined relative to more than one motif. Suppose we have a pattern consisting of motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5- 10 residues right of B, and also C is never more than 15 residues from A. Then it is quite consistent with the methods to include motif C into the pattern twice using the AND operator: once relative to A and once relative to B. This will define the relative spacing and the ORDER of the motifs in the pattern. (If we simply defined the position of C relative to A it could be found to the left of B). .para Motifs combined together using the OR operator are all given the same range. For example suppose we had a pattern A AND (B OR C) AND (D OR E), then B and C each have the same range, and D and E also have the same range as one another. The range for D and E can be relative to A or to B. .para Motifs cannot have their ranges defined relative to motifs that are included using the NOT operator. For example if we had the pattern A NOT B AND C, then the range for C can only be defined relative to motif A. .para Speed can be gained by arranging the order of the motifs so that those higher up the list are of types that can be searched for rapidly and that are also unlikely to be found. .para Motifs combined by the OR operator are alternatives: if any one of a set of motifs combined by the OR operator is found, then a match is declared. All alternatives will be reported. For example if we had a pattern defined by A AND (B OR C), then all places where A occurs and B is found within range, and all places where A is found and C is found within range will be reported. A typical use would be where we might allow a motif to appear on either strand of the DNA sequence. For example a weight matrix representing the heatshock element could be used in a pattern which included heatshock as a motif class 4 combined using the OR operator with heatshock as a motif class 5. .para The probability calculations are performed for each motif as it is defined. If an overall probability cut-off is given the calculation is repeated for each match found. To achieve maximum searching speed do not give an overall probability cut-off. Overall cut-off scores should only be used if the motif classes used are compatible. .para There are currently several ways to display the matches: 1 = each motif and its position is listed; 2 = all the sequence between the two outermost motifs is listed; 3 = graphical, with a spike marking the position of the leftmost motif. The library versions also give entry names, and a one line title; in addition they can be used to produce aligned families of sequences. When this mode of output is selected the program will write a separate file for each match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the name of the entry in the library. The matching sequence will be written out so that the spacing between motifs is constant, and set to the maximum allowed by the pattern definition. Any gaps will be filled with dashes (-). If the individual sequences were subsequently written one above the other they should line up so that all motifs are in register. There two types of output of this sort: one, option 4, writes out whole sequences, the other, option 5, writes out only the sequences between the two outermost motifs. If the individual sequences were subsequently written one above the other they should line up so that all motifs are in register. There two types of output of this sort: one, option 4, writes out whole sequences, the other, option 5, writes out only the sequences between the two outermost motifs. Note that for option 4 users are asked to type the position of the first motif, and the reason for this is explained below. Consider a pattern found in several sequences. Consider only the first motif in the pattern and suppose that it was found in different positions in these sequences. Say that of these positions the one furthest from the left end was position 100. Then, in order to ensure that all the sequences would align, we must specify that motif 1 must start at position 100. Any sequences in which motif 1 started nearer to the left end than position 100 would be padded accordingly. These modes of output should only be used when the position of each motif is defined relative to its immediate neighbour. .para The pattern descriptions can be saved to files. These files can be used instead of typing definitions again at the keyboard. As the files are annotated, they can easily be changed using system editors, and the modified versions used to define the variant patterns for the programs. .para Use of lists of entry names .para The two programs that operate on libraries have the ability to restrict their searches to subsets of the libraries. This does not require sublibraries to be created but instead is achieved by using files containing a list of the entry names of sequences. The user may choose to search only those entries on the list or, alternatively to search all but those on the list (i.e. in the latter case the list contains the names of those to be excluded). The programs can search libraries that have indexes and those that do not. If a list of names for inclusion is used, then the search will be faster if the index is present. In all other circumstances the whole library will be read. The list must be in library order except when it is used to include entries, and an index is available. The list must contain each entry name on a separate line, with the name starting in column 1 of the line. ie there must be no spaces at the start of the line. The list of entry names can be produced by the keyword searches of nip, pip, etc as long as the listings produced have a space character separating the entry name from the entry description. This will depend on how well the library reformatting programs work. For example swissprot entry names tend to run into the beginning of the descriptions, but other libraries are generally OK. .para One use of the programs is to look for patterns that we already know about, but in new sequences. However it is hoped that they will also be useful for finding new motifs. For example several known control regions in nucleic acid sequences consist of particular direct or inverted repeats; the inclusion of direct and inverted repeats as motif classes makes it possible to find previously unknown motifs of these types. Using these new programs we can ask questions like: "are there any inverted or direct repeats near to sections of sequence that contain both a CCAAT box and a TATA box?"; and to search for such things throughout the libraries. In addition, the mode of output in which all the sequence between the two outermost motifs found is printed out, allows us to extract sequences and examine them in more detail for further common subsequences. For example we might want to collect together all the sequences between putative CCAAT and TATA boxes. .para A further use of the inverted repeat motif class is the following. If a regulatory sequence in DNA is poorly defined but also an inverted repeat, then it might be an advantage to specify it both as a consensus sequence and a superimposed inverted repeat. In this way two weak definitions can be combined to produce a stronger pattern. .para Given only a few examples of a motif it should be possible to perform initial searches using a class 3 motif, and then, using plausible matching sequences, create a more specific weight matrix for the same motif. .para If motifs are combined with the first motif using the OR operator they will be ignored until all permutations that include the first motif have been looked for. The whole search will then be repeated, in turn, for each of those motifs that are combined with the first motif using the OR operator. An interesting consequence of this is that the program can be used, without change, to compare any newly determined sequence with all known individual motifs. We achieve this by having a pattern in which all known relevant motifs are combined using the OR operator. If we ask to use this pattern with a sequence, the program will automatically compare each individual motif in the pattern with the whole length of the sequence. As the number of known motifs grows this should become an increasingly useful standard procedure. .para The NOT operator is obviously useful for making sure particular motifs are not present, but it can also be used to bracket the levels of matches found. We may want a degree of match that lies between two limits - binding should occur, but not too strongly; or base-pairs should form, but not too many. We can specify this by asking for a match with a low score, in combination with a match and a high score, both for the same motif, but with the high score included using the NOT operator. .para The algorithm is designed to find all sections of a sequence that satisfy the pattern rather than only the best match. Particularly if some of the motifs in a pattern are less well defined than others, this can often result in the same region of a sequence being reported as having several matches, but which only vary in the positions of the weakest motifs. .para General remarks on motif searching .para Generally motifs are short subsequences that are thought to be associated with particular functions in some known sequences. Often we search for them to try to understand or interpret other sequences. Sometimes we search for motifs and patterns to test a hypothesis about their role: are they found in the expected positions in the expected sequences. In doing so we should remember that, in both proteins and nucleic acids, what we are really looking for is a particular three dimensional structure with certain affinities for other structures, and that we are assuming that the sequence of the motif alone defines the 3D structure we searching for. The overall structure may be completely different to those in which the motif is functional, and hence the motif may have a different shape or be inaccessible. We should be aware of the importance of the context in which a motif is found. Where does it lie relative to the overall structure, is it accessible, is the three dimensional spacing between it and other motifs correct? For example, is it on the same side of the double helix, and the correct distance from some other motif? How does context affect our assessment of the significance of finding a motif? Finding false mammalian mRNA splice junctions in non-coding sequences is far less important than finding false sites in pre-mRNA sequences, but finding them in the correct places is most important! In other words, it is often the case that when we are searching for a motif that is known to be necessary for some function, then a positive result in the form of a match in the required position, is more important than a high background of matches in the wrong positions. Being able to write down the probability of finding a motif in a random sequence tells us how well it is defined. In nucleic acids the DNA may contain many superimposed types of information such as those concerned with histone phasing, protein coding or mRNA secondary structure. These overlapping "codes" may interfere with one another causing matches to motifs to be poorer than expected. In general we will only have a limited number of examples of the motif and we do not know how representative they are. .para Sequences have superimposed functions: some parts may be of general structural importance and give rise to an overall framework, and other parts give specificity and hence are not common; we may want to use a set of aligned sequences to define a motif, but want to use only the framework positions. Alternatively we may want to pick out only those parts of a set of aligned sequences that give a particular property, and to ignore other similarities that are due to some other property and which could obscure the pattern we are interested in. It is possible to apply a mask to a set of aligned sequences in order to give weight to selected positions only. The ability to define a mask allows certain positions to be used in the motif and others to be ignored, and yet still permits the use of a set of aligned sequences to calculate weights. The mask is requested and applied by the program and results in the masked positions being zero in the weight matrix. The mask is defined in the following way. Suppose we had a motif of length 15, then the mask x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is the dashes (-) that are significant and that positions 1,4,7,8,10,11,12,13,14 and 15 will be non-zero). Of course the same set of sequences could be used with several alternative masks in order to extract different features and create corresponding weight matrices. .para The programs are described in Staden,R. CABIOS 4, 53-60, 1988; Staden,R. CABIOS 5, 89-96, 1989, and Methods in Enzymology 183, 193-211 (1990). .left margin1 @ end of help