@-1. TX 0 @General @-2. T 0 @Screen control @-2. X 0 @Screen @-3. TX 0 @Modification @0. TX -1 @SAP This is an interactive program whose primary use is for managing shotgun sequencing projects, but it can also be used for handling alignments of other sequences, including those of proteins. Currently the maximum 'gel reading' length is set to 4096 characters. Almost all of the information below describes the use of the program for shotgun projects, but those using the programs for handling other sequence alignments should interpret it accordingly. The data for such a project is stored in a special type of database. The program contains the tools that are required to type in gel readings, screen them against vector sequences and restriction sites; enter new gel readings into the database (automatically comparing and aligning them). In addition it contains editors and functions to examine the quality of the aligned sequences. There are three main menus: "general", "graphics" and "modification", and some functions have submenus. The general menu contains the following options: 0 = List of menus ? = Help ! = Quit 3 = Open a database 4 = Edit contig 5 = Display a contig 6 = List a text file 7 = Direct output to disk 8 = Calculate a consensus 17 = Screen against restriction enzymes 18 = Screen against vector 19 = Check consistency 25 = Show relationships 27 = set parameters 28 = Highlight disagreements 29 = Examine quality The graphics menu contains: 0 = List of menus ? = Help ! = Quit 10 = Clear graphics 11 = Clear text 12 = Draw ruler 13 = Use cross hair 14 = Change margins 15 = Label diagram 16 = Plot map 33 = Plot single contig 34 = Plot all contigs The modification menu contains: 0 = List of menus ? = Help ! = Quit 4 = Edit a contig 9 = Screen edit 20 = Auto assemble 21 = Enter new gel reading 22 = Join contigs 23 = Complement a contig 24 = Copy database 26 = Alter relationships 30 = Auto edit a contig 31 = Type in gel readings 32 = Extract gel readings The enter new gel reading menu contains: ? = Help ! = Quit 3 = Complete entry 4 = Edit contig... 5 = Display overlap 6 = Edit new gel reading... The join contig menu contains: ? = Help ! = Quit 3 = Complete join 4 = Edit left contig... 5 = Display joint 6 = Edit right contig... 7 = Move join The alter relationships menu contains: ? = Help ! = Quit 3 = Line change 4 = Edit single gel reading... 5 = Delete contig 6 = Shift 7 = Move gel reading 8 = Rename gel reading 9 = Break contig The edit menu contains: ? = Help ! = Quit 3 = Insert 4 = Delete 5 = Change Overview of the methodology The shotgun sequencing strategy In the shotgun sequencing procedure the sequence to be determined is randomly broken into fragments of about 400 nucleotides in length. These fragments are cloned and then selected randomly and their sequences determined. The relationship between any pair of fragments is not known beforehand but is found by comparing their sequences. If the sequence of one found to be wholly or partially contained within that of another for sufficient length to distinguish an overlap from a repeat then those two fragments can be joined. The process of select, sequence and compare is continued until the whole of the DNA to be sequenced is in one continuous well determined piece. Definition of a contig A CONTIG is a set of gel readings that are related to one another by overlap of their sequences. All gel readings belong to a contig and each contig contains at least one gel reading. The gel readings in a contig can be summed to produce a continuous consensus sequence and the length of this sequence is the length of the contig. The rules used to perform this summation are given under "the consensus algorithm". At any stage of a sequencing project the data will comprise a number of contigs; when a project is complete there should be only one contig and its consensus will be the finished sequence. Note that since being introduced and defined as above the word "contig" has been taken up by those involved in genomic mapping. In that context the consensus with a precise length is not defined. Introduction to the computer method It is useful to consider the objectives of a sequencing project before outlining how we use the computer to help achieve them. The aim of a shotgun sequencing project is to produce an accurate consensus sequence from many overlapping gel readings. It is necessary to know, particularly at the latter stages of the project, how accurate the consensus sequence is. This enables us to know which regions of the sequence require further work and also to know when the project is finished. To show the quality of the consensus, the programs described here produce displays like that shown below. 10 20 30 40 50 -6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA 60 70 80 90 100 -6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC -3 HINW.007 GGCACA*GTC CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC 110 120 130 140 150 -6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG -3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG -5 HINW.009 GGCAGCGACCGTCG 17 HINW.999 AGCGACCGTCG CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG 160 170 180 190 200 -6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC* -3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC* -5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC* 17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG 12 HINW.017 GTAGAGC* CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC* This is an example showing the left end of a contig from position 1 to 200. Overlapping this region are gel readings numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in reverse orientation to their original reading (denoted by a minus sign). Each gel reading also has a name (eg HINW.010). It can be seen that in a number of places the sequences contain characters other than A,C,G and T. Some of these extra characters have been used by the sequencer to indicate regions of uncertainty in the initial interpretation of the gel reading, but the asterisks (*) have been inserted by the automatic assembly function in order to align the sequences. Underneath each 50 character block of gel reading sequences is the consensus derived from the sequences aligned above (the line labelled CONSENSUS). For most of its length the consensus has a definite nucleotide assignment but in a few positions there is insufficient agreement between the gel readings and so a dash (-) appears in the sequence. This display contains all the evidence needed to assess the quality of the consensus: the number of times the sequence has been determined on each strand of the DNA, and the individual nucleotide assignments given for each gel reading. So the aim is to produce the consensus sequence and, equally important, a display of the experimental results from which it was derived. In order to achieve this the following operations need to be performed: 1) Interpret autoradiographs and put individual gel readings into the computer. 2) Check each gel reading to make sure it is not simply part of one of the vectors used to clone the sequence. 3) Check each gel reading to make sure that those fragments that span the ligation point used prior to sonication are not assembled as single sequences. 4) Compare all the remaining gel readings with one another to assemble them to produce the consensus sequence. 5) Check the quality of the consensus and edit the sequences. 6) When all the consensus is sufficiently well determined, produce a copy of it for processing by other analysis programs. It is very unlikely that this procedure will only be passed through once. Usually steps 1 to 5 are cycled through repeatedly, with step 4 just adding new sequences to those already assembled. Generally step 6 is also used in order to analyse imperfect sequence to check if it is the one the project intended to sequence, or to look for interesting features. Analysis of the consensus, such as searches for protein coding regions, can also help to find errors in the sequence. The display of the overlapping gel readings shown above can be used to indicate, not only the poorly determined regions, but also which clones should be resequenced to resolve ambiguities, or those which can usefully be extended or sequenced in the reverse direction, to cover difficult regions. The original individual gel readings for a sequencing project are each stored in separate files. As the gel readings are entered into the computer (usually in batches, say 10 from a film), the file names they are given are stored in a further file, called a file of file names. Files of file names enable gel readings to be processed in batches. For each sequencing project we start a project database. This database has a structure specifically designed for dealing with shotgun sequence data. In order to arrive at the final consensus sequence many operations will be performed on the sequence data. Individual fragments must be sequenced and compared in both senses (i.e. both orientations) with all the other sequences. When an overlap between a new gel reading and a contig are found they must be aligned and the new gel reading added to the contig. If a new gel reading overlaps two contigs they must be aligned and joined. Before the two contigs are joined one of them may need to be turned around (reversed and complemented) so they are both in in the same orientation. Clearly, keeping track of all these manipulations is quite complicated, and to be able to perform the operations quickly requires careful choice of data structure and algorithms. For these reasons it is not practicable to store the gel readings aligned as shown in the display above. Rather, it is more convenient to store the sequences unassembled, and to record sufficient information for programs to assemble them during processing. The data used to assemble the sequences is called relational information. The database comprises three files and they are described under the section entitled "open database". Before entry into the project database each new gel reading must be compared to look for overlaps with all the data already contained within the database. This last point is important: all searching for overlaps is between individual new gel readings and the data already in the database. There is no searching for overlaps between sequences within the database; overlaps must be found before new gel readings are entered into the database. Below I give an introduction to how the sequencess are processed by being passed from one function to the next. This program is used to start a database for the project and then the following procedure is used. Data in the form of individual gel readings are entered into the computer and stored in separate files using either program this program or the digitizer program. Batches of these gel readings are passed to the screening functions in this program to search for overlaps with vector sequences ("screen against vector") or for matches to restriction enzyme sites that should not be present ("screen against enzymes"). Each run of these screening functions passes on only those gel readings that do not contain unwanted sequences. Sequences are passed via files of file names and eventually are processed by the automatic assembly function ("auto assemble"). This function compares each gel reading with a consensus of all the previous gel readings stored in the database. If it finds any overlaps it aligns the overlapping sequences by inserting padding characters, and then adds the new gel reading to the database. Gels that overlap are added to existing contigs and gels that do not overlap any data in the database start new contigs. If a new gel overlaps two contigs they are joined. Any gel readings that appear to overlap but which cannot be aligned sufficiently well are not entered and have their names written to a file of failed gel reading names. Generally data is entered into the database in batches as just described. The program is also used to examine the data in the database, to enter gel readings that the automatic assembly function cannot align ("enter new gel reading"), and to make final edits. Edits to whole contigs can be made in several ways. An automatic editor ("auto edit") will perform almost all edits without any user intervention, but the program also gives access to the system editor (EDT on the VAX), through the function "screen edit", and to simple command driven editors ("edit contig" and "edit new gel reading"). Disagreements between gel readings in contigs and their consensus sequences can be highlighted by use of the function "highlight disagreements". Editing the sequences is obviously an essential part of managing a sequencing project. Editing is required when new sequences are added, when contigs are joined, and when sequences are corrected. A basic part of the strategy used here is that new gel readings should be correctly aligned throughout their whole length when they are entered into the database, and that when contigs are joined they are edited so that they are well aligned in the region of overlap. Alignment can be achieved by adding padding characters to the sequences, and this is the way "auto assemble" operates when adding new sequences to the database. In order to search for overlaps that may have been missed due to errors in the gel readings, the function "extract gel readings" can be used to take copies of the gel readings at the ends of contigs, and write them out as separate files. These can then be compared with the database consensus using the "auto assemble" function in a mode that forbids entry of data into the database, and any gel reading matching two contigs will indicate a join that has been missed. The joins can then be made interactively using "join contigs". Missed matches can be found at this stage because the errors in the sequences may have been corrected by new data. Generally the users need not concern themselves with how the relational information is used by the program, but it is necessary to know how contigs are identified. Because contigs are constantly being changed and reordered the program identifies them by the numbers of the gel readings they contain. Whenever users need to identify a contig they need only know the number or name of one of the gel readings it contains. Whenever the program asks users to identify a contig or gel reading they can type its number or its archive name. If they type its archive name they must precede the name by a slash "/" symbol to denote that it is a name rather than a number. E.g if the archive name is fred.gel with number 99, users should type /fred.gel or 99 when asked to identify the contig. Generally, when it asks for the gel reading to be identified, the program will offer the user a default name, and if the user types only return, that contig will be accessed. When a database is opened the default contig will be the longest one, but if another is accessed, it will subsequently become the current default. Further information is located in the following places. The database files are described under "open database". The format for vector and consensus sequences is given under "calculate a consensus", as are the uncertainty codes used in gel readings. The only program, other than this, relevant to sequencing is the digitizer program and it is outlined briefly below. The digitiser program is used for the initial input of gel readings and for writing a file of file names. The program uses a digitizer for data entry. A digitizer is a two dimensional surface such as a light box which is such that if a special pen is pressed onto it, the pens coordinates are recorded by a computer. These coordinates can be interpreted by a program. In order to read an autoradiograph placed on the light box the user need only define the bottom of the four sequencing lanes and the bases to which they correspond and then use the pen to point to each successive band progressing up the gel. The program examines the coordinates of each pen position to see in which of the four lanes it lies and assigns the corresponding base to be stored in the computer. Each time the pen tip is depressed to point to a position on the surface of the digitizer the program sounds the bell on the terminal to indicate to the user that a point has been recorded. As the sequence is read the program displays it on the screen. @17. TX 1 @Screen against restriction enzymes Used to compare gel readings against any restriction enzyme recognition sequences that may have been used during cloning and which should not be present in the data. Works on single gel readings or processes batches accessed through files of file names. The algorithm looks for exact matches to recognition sequences stored in a file. The file containing the recognition sequences must be identified. The user must choose between employing a file of file names, or typing in the names of individual gel reading files. If a file of file names is used the program will also create a new file of file names. When the option has finished operating this new file will contain the names of all those gel readings that did not match any of the recognition sequences. Hence it can be used for further processing of the batch. The recognition sequences should be stored in a simple text file with one recognition sequence per line. @18. TX 1 @Screen against vector Used to compare gel readings against any vector sequences that may have been picked up during cloning. Works on single gel readings or processes batches accessed through files of file names. The algorithm looks for exact matches of length "minimum match length" and displays the overlapping sequences. The file containing the vector sequence must be identified. The user must choose between employing a file of file names, or typing in the names of individual gel reading files. If a file of file names is used the program will also create a new file of file names. When the option has finished operating this new file will contain the names of all those gel readings that did not match the vector sequence. Hence it can be used for further processing of the batch.The vector sequence should be stored in a simple text file with up to 80 characters of data per line. More than one vector can be stored in a single file. If so each should be preceded by a 20 character title of the form <---m13mp8.001-----> where the < and > signs and the number like .001 are obligatory. The number must be preceded by a dot (.) and be 3 digits long. The total sequence in the file must be < 50,001 characters in length. @20. TX 2 @Auto assemble Compares gel readings against the current contents of the database and produces alignments. In its normal mode of operation ("entry permitted"), the function will automatically enter the gel readings into the database, but if entry is not permitted it will only produce alignments. It works on single gel readings or processes batches of gel readings accessed through files of file names. It is the usual way to enter data into the database. The function will check the database for logical consistency and will only procede if it is OK. Choose if gel readings should be entered into the database, or if they should only be compared. Choose between using a file of file names or typing file names on the keyboard. If so selected, supply the file of file names. Also supply a file of file names to contain the names of all the gel readings that fail to get entered. Select the entry mode. Normal assembly is appropriate for all but special cases, as is "permit joins". Uses for the other modes are not documented here. Define a minimum initial match length. Define a minimum alignment block (the default value is taken in all but exceptional circumstances). Define the maximum number of paddding characters allowed to be used in each gel reading to help achieve alignment, and the same for the number allowed in the contig for each gel reading. Finally define the maximum percentage mismatch to be allowed for any gel reading to be entered into the database. If for any gel reading, either of these last three values is exceeded the gel reading will not be entered into the database. In operation the function takes a batch of gel readings (probably passed on as a file of file names from one of the screening routines) and enters them into a database for a sequencing project. It takes each gel reading in turn, compares it with the current consensus for the database, it then produces an alignment for any regions of the consensus it overlaps; if this alignment is sufficiently good it then edits both the new gel reading and the sequences it overlaps and adds the new gel reading to the database. The program then updates the consensus accordingly and carries on to the next gel reading. All alignments are displayed and any gel readings that do match but that cannot be aligned sufficiently well have their names written to a file of failed gel reading names. The function works without any user intervention and can process any number of gel readings in a single run. Those gel readings that fail can be recompared using the same function (to find the current overlap position) and the user can enter them into the database manually using the "enter new gel reading" option. Typical dialogue and output from the function is shown below. (Note that output for gel readings 2 - 9 has been deleted to save space). Automatic sequence assembler Database is logically consistent ? (y/n) (y) Permit entry ? (y/n) (y) Use file of file names ? File of gel reading names=demo.nam ? File for names of failures=demo.fail Select entry mode X 1 Perform normal shotgun assembly 2 Put all sequences in one contig 3 Put all sequences in new contigs ? Selection (1-3) (1) = ? (y/n) (y) Permit joins ? Minimum initial match (12-4097) (15) = ? Minimum alignment block (2-5) (3) = ? Maximum pads per gel (0-25) (8) = ? Maximum pads per gel in contig (0-25) (8) = ? Maximum percent mismatch after alignment (0.00-15.00) (8.00) = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Processing 1 in batch Gel reading name=HINW.004 Gel reading length= 283 Searching for overlaps Strand 1 Strand 2 No matches found Total matches found 1 Padding in contig= 0 and in gel= 1 Percentage mismatch after alignment = 1.8 Best alignment found 1 11 21 31 41 51 TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG ********** ********** ********** ********** ********** ********** TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG 1 11 21 31 41 51 61 71 81 91 101 111 TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG ********** ********** ********** ********** ********** ***** **** TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG 61 71 81 91 101 111 121 131 141 151 161 171 CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT ********** ********** ********** ********** ********** ********** CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT 121 131 141 151 161 171 181 191 201 211 221 231 GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC ********** ********** * ******** ********** ********** ********** GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC 181 191 201 211 221 231 241 251 261 271 281 CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC **** **** ********** ** * ***** ********** **** CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC 241 251 261 271 281 Batch finished 9 sequences processed 0 sequences entered into database 0 joins made Note that "auto assemble" cannot align protein sequences. @28. TX 1 @Highlight disagreements Used in the latter stages of a project to highlight disagreements between individual gel readings and their consensus sequences. Characters that agree with the consensus are shown as : symbols for the plus strand and . for the minus strand. Characters that disagree with the consensus are left unchanged and so stand out clearly. The results of this analysis are written to a file. Before selecting this option create a file of the display of the contig to be "highlighted". The option will ask for the name of this file. Select symbols to denote "agreeing" characters on each strand, the defaults are : and ., but any others can be used. Supply the name of a file in which to put the output. The display file needed as input for this option is created by selecting "Redirect output", followed immediately by "display contig", and then "Redirect output" again. The cutoff score used in the consensus calculation can be set by option "set display parameters". Note that for the highlight function there is a limit of 50 for the number of gel readings that are aligned at any position - ie the contig must be less than 51 gel readings deep at its thickest point. I hope that those performing shotgun sequencing never reach this limit, but those using the program for comparing sequence families might. Typical output from this function is shown below. 210 220 230 240 250 1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC:::: 7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA:::: -4 HINW.017 ...............AC.... G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG 260 270 280 290 300 1 HINW.004 ::::::::::::*:D::::::::::::::::::: 7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA: -4 HINW.017 ..............................................A... 3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV::: -6 HINW.028 ......................A... AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG @32. TX 3 @Extract gel readings Used to make copies of the aligned gel readings in a database, to write them into separate files, and to write a corresponding file of file names. It operates in two modes: either all gel readings are extracted, or only those at the ends of contigs. Choose which mode of operation is required and supply a file of file names. The gel readings are given their original names. If used to extract the gel readings from the ends of contigs the function is useful for checking for missed contig joins: the file of file names can be used with the auto assemble function to recompare these gel readings, and each should only overlap one contig. Any that overlap two contigs will identify possible joins. If the option is used to extract all the gel readings from a database, a subsequent run of "auto assemble" can reconstitute a database which has been corrupted. This rarely occurs and is usually necessesitated by a user employing "alter relationships" incorrectly without first having made a copy. @1. TX 0 @Help Help is available on the following topics : @2. TX 0 Quit This command stops the program and is the only safe way to terminate a run of the program that has altered the contents of the database in any way. @3. TX 1 @Open a database Opens existing databases or allows new ones to be started. The function is automatically called into operation when the program is started but can also be selected from the general menu. Choose to open an existing database or start a new one, or if ! is typed when the program is first started, enter the program without opening a database. Supply a project database name, and if it already exists, the "version". If starting a new database define the database size and if it is for DNA or protein sequences. The database size is an initial size for the database. It can be increased later during the project. It is the sum of the number of gel readings plus the number of contigs. Database names can have from one to 12 letters and must not include full stop (.). The database is made from three separate files. If the database is called FRED then version 0 of database FRED comprises files FRED.AR0, FRED.RL0 and FRED.SQ0. The version is the last symbol in the file names. Only this program can read these files. If the "copy database" option is used it will ask the user to define a new "version". For normal use the maximum gel reading length is set to 512 characters, but when a database is started the user may choose lengths of either 512, 1024, 1536..., 4096. Normally the program is used to handle DNA sequences but many of the functions also work on protein sequences. The choice of sequence type is made when the database is started. The contigs are not stored on the disk as the user sees them displayed on the screen. Each gel reading is stored with sufficient information about how it overlaps other gel readings so that the program can work out how to present them aligned on the screen. We refer to this extra data as "the relationships" and it is explained below. The database comprises 3 separate files. 1. a working version of each gel reading. This is the version of the gel reading that is in the database and initially it is an exact copy of the original sequence (known as the archive) but it is edited and manipulated to align it with other gel readings. 2. the file of relationships. This file contains all of the information that is required to assemble the working versions into contigs during processing; any manipulations on the data use this file and it is automatically updated at any time that the relationships are changed. The information in this file is as follows: (A) Facts about each gel reading and its relationship to others ("gel descriptor lines"): (a) the number of the gel reading (each gel reading is given a number as it is entered into the database) (b) the length of the sequence from this gel reading (c) the position of the left end of this gel reading relative to the left end of the contig of which it is a member (d) the number of the next gel reading to the left of this gel reading (e) the number of the next gel reading to the right (f) the relative strandedness of this gel reading , ie whether it is in the same sense or the complementary sense as its archive. (B) Facts about each contig ("contig descriptor lines"): (a) the length of this contig (b) the number of the leftmost gel reading of this contig (c) the number of the rightmost gel reading of this contig. (C) General facts: (a) the number of gel readings in the database (b) the number of contigs in the database. 3. the file of archive names. This is simply a list of the names of each of the archive files in the database but on line number 1000 we also store the size of the database. ie the number of lines of information allowed in the database files. This file always has 1000 lines but the length of the file of relationships and the file of working versions can be set by the user when creating a database or when copying from one to another. Structure of the database files 1. The file of relationships The file contains IDBSIZ lines of data: the general data are stored on line IDBSIZ; data about gel readings are stored from line 1 downwards; data about contigs are stored from line IDBSIZ-1 upwards. A database of 500 lines containing 25 gel readings and 4 contigs would have a file of relationships as is shown below. --------------------------------------------- 1 Gel descriptor record 2 " " " 3 " " " 4 " " " 5 " " " ' ' ' ' ' ' ' ' 25 " " " 26 Empty record ' ' ' ' ' ' 495 ' ' 496 Contig descriptor record 497 " " " 498 " " " 499 " " " 500 Number of gel readings=25, Number of contigs=4 --------------------------------------------- The arrangement of the data in the file of relationships As each new gel reading is added into the database a new line is added to the end of the list of gel descriptor lines. If this new gel reading does not overlap with any gel readings already in the database a new contig line is added to the top of the list of contig lines. If it overlaps with one contig then no new contig line need be added but if it overlaps with two contigs then these two contigs must be joined and the number of contig lines will be reduced by one. Then the list of contig lines is compressed to leave the empty line at the top of the list. Initially the two types of line will move towards one another but eventually, as contigs are joined, the contig descriptor lines will move in the same direction as the gel descriptor lines. At the end of a project there should be only one contig line. The database is thus capable of handling a project of 998 gels. Structure of the working versions file The working versions of gel readings are stored in a file of IDBSIZ lines each containing 512 characters. Gel reading number 1 is stored on line 1, gel reading number 2 on line 2 and so on. Structure of the archive names file This file, unlike the others, always has 1000 lines each 10 characters in length. Its length is fixed because line 1000 is used to store IDBSIZ the database size and the programs need a definite location from which to read this number. Safeguarding the database It is advisable to copy regularly (using the copy function of DS) from say copy 0 to copy 1 in case of errors. I also recommend setting the protection codes on copy 0 of each database so that users cannot delete the files without first resetting the protection codes. This will protect you from accidently deleting the files. Users at LMB can use the PROTECT command for this purpose. The give-up options allow users to change their minds about entering a new gel reading or joining two contigs without affecting the file of relationships. BUT if the edit contig option from either of these two functions has been used the edits will remain even though the user has "given up". To leave the files completely unaffected the user could, if required, undo any edits before "giving up". There are various checks within the programs to protect users from themselves:- 1. All user input is checked for errors - e.g. reference to non-existent gel readings or contigs, incorrect positions in the contig or gel readings. 2. Before entering a gel reading the system checks to see if a file of the same name has already been entered. 3. Join will not allow the circularising of a contig. 4. Both enter and join functions restrict the region that the user is allowed to edit (using edit contig) to the region of overlap. 5. Users may escape from any point in the program. 6. Help is available from all points in the program. IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE DATABASE. I.E DURING AUTO ASSEMBLE, COMPLETE ENTRY, COMPLETE JOIN, COMPLEMENT CONTIG, EDIT CONTIG, AND SCREEN EDIT. This could corrupt the database so badly that it is impossible to fix. The program should always be left using the QUIT option. @4. TX 3 @Edit A simple commnd driven editor that can insert, delete and change gel reading sequences. Insert, delete and change commands will request the position at which the edit is required and the number of characters to insert, delete or change. The default character for insertions is *. There are three modes of editing offered by this editor depending where it is selected from. New gel readings can be edited as they are being entered into the database, contigs can be edited with alignments being automatically maintained, or gel readings in contigs can be edited without the maintenance of alignments. The following commands can be used. ? = Help ! = Quit 3 = Insert 4 = Delete 5 = Change All commands request the position at which the edit should be made. (Note that the position refers to the position in the contig for gel readings in the database, but to the position in the gel reading if you are editing a new gel reading while entering it into the database.) All commands request the number of characters to operate on. (Note that if you are editing a contig the program will ask for the characters to insert into each separate gel reading, hence allowing different changes to be made to each. Also the default character is asterisk (*) - i.e if you include a space in the string it will be replaced by an asterisk, or if you simply type return the whole string inserted will be asterisks.) "Change" allows characters in individual gel readings to be replaced. If the user is not editing a new gel reading during "enter new gel reading" the program will request the numer of the gel reading to edit. (When editing gel readings in contigs the program responds with the relative position and length of the selected gel reading in case the the user only knows the edit position relative to the gel reading. (The edit position must be relative to the contig.)) Further notes on editing When you are editing a contig the program maintains the alignments of the gel readings by always making the same number of insertions or deletions in all the gels. Note that these edits are immediately carried out and the "Quit" options of "enter new gel reading" and "join contigs" do not undo them. Users must undo them themselves. Note that if this option has been entered from either "enter new gel reading" or "join contigs" the program will restrict edits to the region of overlap. DO NOT KILL THE PROGRAM DURING EDIT CONTIG! When editing a single gel reading in a contig from "alter relationships" (which you should not normally need to do) the program will correct the length of the individual gel reading, but it will not update the length of the contig if it has changed. The program contains better methods than this simple command driven editor, for making multiple edits to contigs. "Screen edit", gives access to the system editor on your machine, and "auto edit" will edit a whole contig automatically. @9. TX 3 @Screen edit Gives access to the system editor on the machine (for example EDT on a VAX) and allows users to edit contigs. The contigs are presented as for "display contig" and the program will reconstitute the contig's sequences and relationships when the editor is exited. To screen edit a contig set the line length to 50 characters, select the contig to edit, and supply the name of a temporary file in which the editing will be performed. After a short pause the system editor will present the first page of the file. Edit the file obeying the rules given below. Exit from the editor and affirm the intention of returning the contig to the database. The program will put the contig back into the database. Rules for screen editing There are some limitations on the changes that can be made to the contigs when using the screen editor. Users are unlikely to want to break the rules in order to achieve changes to contigs, but nevertheless the constraints need to be defined and they are given below. Alignments must be maintained during editing. Whole lines of sequence should not be deleted or added unless the order of the gel readings in the contig is preserved. Each line in the contig display consists of gel reading numbers, their names and 50 character sections of sequence. Insertions are limited in the following way. No line of sequence can be extended rightwards more than 10 characters beyond the end of a full length line (a full length line is 50 characters long). Only one character can be added to the left end of full length lines, but sections of sequence beginning further into a line can be extended leftwards up to an equivalent position. Do not delete any non-sequence lines in the file. Before returning the contig to the database the program checks that the rules have been obeyed. If an error is found the number of the erroneous line in the file is displayed and the contig will not be changed. @5. TX 1 @Display a contig Used to show the aligned gel readings for any part of a contig. The number, name and strandedness of each gel reading is shown and the consensus is written below. If required identify the contig, and then the start and end points of the region to display. The display can be directed to a disk file using "direct output to disk". These files are required by options: "screen edit" and "highlight disagreements", and printed copies of them are very useful for marking corrections prior to using the editors. Below is an example showing the left end of a contig from position 1 to 200. Overlapping this region are gels 6,3,5,17and 12; 6, 3 and 5 are in reverse orientation to their archives (denoted by a minus sign) There are a few uncertainty codes and a few padding characters in the working versions, but the consensus (shown below each page width) has a definite assignment for almost every position. 10 20 30 40 50 -6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA 60 70 80 90 100 -6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC -3 HINW.007 GGCACA*GTC CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC 110 120 130 140 150 -6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG -3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG -5 HINW.009 GGCAGCGACCGTCG 17 HINW.999 AGCGACCGTCG CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG 160 170 180 190 200 -6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC* -3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC* -5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC* 17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG 12 HINW.017 GTAGAGC* CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC* @6. TX 1 @List a text file This option allows users to list text files on the screen. It can be used to read a file containing notes, for checking files written to disk etc. The user is asked to type the name of the file to list. @8. TX 1 @Calculate a consensus Calculates a consensus sequence either for the whole database or for selected contigs. The consensus is written to a file named by the user. Supply a file name, choose between whole database or selected contigs. Symbols for uncertainty in gel readings In order to record uncertainties when reading gels the codes shown below can be used. Use of these codes permits us to extract the maximum amount of data from each gel and yet record any doubts by choice of code. The program can deal with all of these codes and any other characters in a sequence are treated as dash (-) characters. SYMBOL MEANING 1 PROBABLY C 2 " T 3 " A 4 " G D " C POSSIBLY CC V " T " TT B " A " AA H " G " GG K " C " C- L " T " T- M " A " A- N " G " G- R A OR G Y C OR T 5 A OR C 6 G OR T 7 A OR T 8 G OR C - A OR G OR C OR T a A set by auto edit c C set by auto edit g G set by auto edit t T set by auto edit * padding character placed by auto assembler else = - The DNA consensus algorithm The "calculate consensus" function, the "display contig" routine and the "show quality" option use the rules outlined here to calculate a consensus from aligned gel readings. Note that "display contig" calculates a consensus for each page width it displays (it does not use the consensus sequence file calculated by the consensus function). We have 6 possble symbols in the consensus sequence: A,C,G,T,* and -. The last symbols is assigned if none of the others makes up a sufficient proportion of the aligned characters at any position in the contig. The following calculation is used to decide which symbol to place in the consensus at each position. Each uncertainty code contributes a score to one of A,C,G,T,* and also to the total at each point. Symbols like R and Y which don't correspond to a single base type contribute only to the total at each point. The scores are shown below. definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1 probable assignments ie 1,2,3,4 = 0.75 other uncertainty codes including R,Y,5,6,7,8,- = 0.1 A cutoff score of 51% to 100% is supplied by the user. (When the program starts this is set to 75%. See "set display parameters"). At each position in the contig we calculate the total score for each of the 5 symbols A,C,G,T and * (denote these by Xi, where i=A,C,G,T or *), and also the sum of these totals (denote this by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed in the consensus; otherwise - is assigned. Notice that S does not equal the number of times the sequence has been determined, but is the score total, and hence we are less likely to put a - in the consensus. For the "examine quality" algorithm each strand is treated separately but the calculation is the same. (It was originally different). Format of the consensus sequence ( and vector sequences). A consensus sequence file may contain the consensus for several contigs and so we identify each of them by preceding them by a 20 character title. The title is of the form <---LAMBDA.076-----> ( where LAMBDA is the project name and gel reading number 76 is the leftmost gel reading to contribute to this consensus sequence). The angle brackets <> and the three digit number precede by a . are important to some processing programs. @25. TX 1 @Show relationships Used to show the relationships of the gel readings in the database in three ways - (a) All contig descriptor lines followed by all gel descriptor lines. (b) All contigs one after the other sorted, i.e. for each contig show its contig descriptor line followed by all its gel descriptor lines sorted on position from left to right (c) Selected contigs: show the contig line and, in order, those gel readings that cover a user-defined region. Note that this output can be directed to a disk file by prior selection of "disk output". Below is an example showing a contig from position 1 to 689. The left gel reading is number 6 and has archive name HINW.010, the rightmost gel reading is number 2 and is has archive name HINW.004. On each gel descriptor line is shown: the name of the archive version, the gel number, the position of the left end of the gel reading relative to the left end of the contig, the length of the gel reading (if this is negative it means that the gel reading is in the opposite orientation to its archive), the number of the gel reading to the left and the number of the gel reading to the right. CONTIG LINES CONTIG LINE LENGTH ENDS LEFT RIGHT 48 689 6 2 GEL LINES NAME NUMBER POSITION LENGTH NEIGHBOURS LEFT RIGHT HINW.010 6 1 -279 0 3 HINW.007 3 91 -265 6 5 HINW.009 5 137 -299 3 17 HINW.999 17 140 273 5 12 HINW.017 12 193 265 17 18 HINW.031 18 385 -245 12 2 HINW.004 2 401 -289 18 0 @21. TX 3 @Enter new gel reading Used to enter new gel readings into the database. The new gel reading must have previously been compared with the contents of the database by use of " auto assemble" in order to ascertain if it overlaps any previously entered data. The user is expected to know: if the gel reading overlaps; if so which contig it overlaps; if so where it overlaps. The program takes the user through a series of question to establish the nature of the overlap and then displays the overlap. The user is then offered a number of options, including editors for the new gel reading and the contig, to enable the correct alignment of the gel reading throughout its whole length. Supply the name of the gel reading file. If the gel reading has been entered before the program will not permit entry. The program gives the gel reading a unique number and asks if the sequence overlaps any data already in the database (reported by "auto assemble"). If it does not, entry is complete. If it does overlap the dialogue continues with the program asking if the gel readings overlaps "in the normal sense", if not it will automatically complement the sequence. Then supply the number of the contig the gel reading overlaps (as reported by "auto assemble"). Overlaps are divided into two types: those for which the new gel reading protrudes from the left end of the contig it overlaps, and those for which it does not. The program asks about this with the question "Left end of gel reading is inside contig". If this is true the program will go on to ask for the position in the contig of the left end of the new gel reading. If it is not true the program will ask for the position in the new gel reading of the left end of the contig. Once this is completed the program will display the first 50 bases of the overlap. The gel readings in the contig and their consensus are displayed with the new gel reading underneath. The mismatches are shown by *'s on the next line down. For example: 60 70 80 90 100 -6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC -3 HINW.007 GGCACA*GTC CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACACGTC NEWGEL CACAAGCGAGCGAGAGGGGCACCGTGACGTGGTCACGCCGGGGACACGTC MISMATCH * * * 10 20 30 40 50 The program then needs to know if the position of the left end of the overlap is correct. If it is the user should type return, if not, 1 and the program will ask for the new position and display it. The program now offers a number of options to allow the user to align the new gel reading correctly over its whole length with the data already in the contig. It is important that sufficient edits are made to the new gel reading or the sequences in the contig at this stage to get the alignment correct, because once entry is completed, the alignment is fixed and cannot easily be changed (see "alter relationships"). Alignment can be achieved by making insertions or deletions but deletion of data requires the original gels to be checked. For this reason at entry we usually make only insertions to achieve alignment. We use X or asterisks (*) as padding characters to achieve alignment and so can, if required, distinguish padding characters from characters assigned from reading gels. The options available are: ? = HELP ! = Give up 3 = Complete entry 4 = Edit contig 5 = Display overlap 6 = Edit new gel reading 1. HELP gives this information. 2. Give up allows users to change their minds about entering the new gel reading. The program will ask the user to confirm this choice. 3. Complete entry is the command to add the new gel reading to the contig. The program updates the relationships accordingly. The user is asked to confirm this command. 4. Edit contig gives the user access to a simple editor that allows insertions, deletions and changes to be made to the contig. The editor maintains alignments by making the same number of insertions or deletions in all sequences covering the edit position. The program protects the user by allowing edits only within the region of overlap. 5. Display allows display of the region of overlap only. This is defined by the relative positions in the contig. The default is the whole of the region of overlap. 6. Edit new gel reading allows the new gel reading to be edited using a simple editor. @23. TX 3 @Complement a contig This function will complement and reverse all of the gel readings in a contig. It automatically reverses and complements each gel reading sequence, reorders left and right neighbours, recalculates relative positions and changes each strandedness. The only user input required is to identify the contig to complement by the number or name of a gel reading it contains. DO NOT KILL THE PROGRAM DURING THIS STEP! @22. TX 3 @Join contigs This function joins contigs interactively. It allows the user to align the ends of the two contigs by editing each contig separately. It is important that the alignment achieved is correct because once the join is completed the alignment is fixed. The program needs to know which two contigs to join and where they overlap. First which two contigs are to be joined. The user should identify the two contigs. First the left contig and then the right. The program checks that the two contig numbers are different (it will not allow circles to be formed!) Now identify the exact position of overlap. This is defined as the position in the left contig that the leftmost character of the right contig overlaps. Normally the position is established by employing the end gel reading for option "auto assemble". The overlap must be of at least one character. The program then displays the join showing all the gel readings overlapping the join from the left contig, their consensus, all the gel readings from the right contig that overlap the join, their consensus and then asterisks to denote mismatches between the two consensuses. For example: 1460 1470 1480 1490 1500 56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG 33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG -25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG 19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG -6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC -3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC -5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC MISMATCH * ****** 10 20 30 40 50 It is essential that the user aligns the two contigs throughout the whole region of overlap before completing the join because it is only at this stage that the two contigs can be edited independently. Once the join is completed the alignment can only be altered using the routines supplied by "alter relationships". The program offers the user options to facilitate the alignment of the two contigs. These options are:- ? = Help ! = Give up 3 = Complete join 4 = Edit left contig 5 = Display joint 6 = Edit right contig 7 = Move join 1. Help gives this information. 2. Give up allows the user to return to the main options without completing the join. Note any edits made will remain. 3. Complete join instructs the program to update the relationships so that the two contigs are joined. DO NOT KILL THE PROGRAM DURING COMPLETE JOIN! 4. Edit left contig and edit right contig give access to a simple editor that allows insertions, deletions and changes to be made to the contigs. Help is available on editing once the editing option is selected. The user is only allowed to edit within the region of overlap and should make sure that the positions used correspond to the correct contig. 5. Display join displays the joint as shown above. 6. See above. 7. Move join allows the position of the joint to be changed. @24. TX 1 @ Copy the database Used to make a copy of the database. If required the database size can be altered using this option. The "version" of a database is encoded as the last letter in the names of the three files that contain the database. Supply a "version" number (the default is version 1), and if required select a new size for the database. The size of a database is the number of lines of information it can hold. It needs a line for each gel reading and another for each contig. @19. TX 1 @ Check database Used to perform a check on the logical consistency of the database. No user intervention is required. The following relationships are checked: 1. If gel reading A thinks gel reading B is its left neighbour does B think A is its right neighbour? The error message is "Hand holding problem for gel reading A" followed by the gel descriptor lines for gel readings A and B. 2. Are there any contig lines with no left or right end gel readings? The error message is "Bad contig line number A" 3. Do the gel readings that are described as left ends on contig lines agree that they are left ends? The error message is "The end gel readings of contig A have outward neighbours" 4. Are there gel readings that are in more than one contig? The error message is " Gel number A is used N times" 5. Are there gel readings that are not in any contig? The error message is " Gel number A is not used" 6. Do the relative positions of gel readings agree with their position as defined by left and right neighbourliness? The error message is " Gel number A with position X is left neighbour of gel number B with position Y" 7. Are there any loops in contigs? If so no further checking is done. The error mesage is " Loop in contig n no further checking done, but gel reading numbers follow" The program then prints the gel reading numbers in the looped contig up to the start of the loop. 8. Are there any contigs of length <1? The error message is " The contig on line number x has zero length" 9. Are there any gel readings (used in only one contig) that have zero length? The error message is " Gel number N has zero length" Note that "auto assemble" also uses this logical consistency check and will only tolerate a "Gel number N is not used" error. Any other error will cause it to give up. @29. TX 1 @ Examine quality Analyses the quality of the data in a contig. It reports on the proportion of the consensus that is "well determined" and will display a sequence of symbols that indicate the quality of the consensus at each position. Identify the contig to analyse, and the section of interest. The current consensus calculation cutoff score will be used to decide if each position is "well determined". In general the quality of a reading deteriorates along the length of the gel and so it is also possible to use a length cutoff for the quality calculation. Only the data from the first section of each reading will be included in the quality calcualtion. The length is altered under "set parameters" and is initially set to the maximum reading length. A summary showing the percentage of the consensus that falls into each category of quality is shown. Choose whether or not to have the quality codes for each position of the consensus displayed. They can be displayed as either graphics or text. The quality of the data depends on the number of times it has been sequenced and the particular uncertainty codes used in each gel reading. This function divides the data into five categories, assigning each a symbol or code: 1. Well determined on both strands and they agree. code=0 2. Well determined on the plus strand only. code=1 3. Well determined on the minus strand only. code=2 4. Not well determined on either strand. code=3 5. Well determined on both strands but they disagree. code=4 A position is "well determined" if it is assigned one of the symbols A,C,G,T when the algorithm described in the section "calculate a consensus". The calculation is performed separately for each strand. If the user chooses to have the data displayed graphically the following scheme is used. A rectangular box is drawn so that the x coordinate represents the length of the contig. The box is notionally divided vertically into 5 possible levels which are given the y values: -2,-1,0,1,2. The quality codes attributed to each base position are plotted as rectangles. Each rectangle represents a region in which the quality codes are identical, so a single base having a different code from its immediate neighbours will appear as a very narrow rectangle. Rectangle bottom and top y values Quality 0 rectangle from 0 to 0 Quality 1 rectangle from 0 to 1 Quality 2 rectangle from 0 to -1 Quality 3 rectangle from -1 to 1 Quality 4 rectangle from -2 to 2 Obviously a single line at the midheight shows a perfect sequence. Typical dialogue is shown below. 41.47% OK on both strands and they agree(0) 55.48% OK on plus strand only(1) 2.08% OK on minus strand only(2) 0.97% Bad on both strands(3) 0.00% OK on both strands but they disagree(4) ? (y/n) (y) Show sequence of codes 10 20 30 40 50 1111111111 1111111111 1111111111 1111111111 1111111111 60 70 80 90 100 1111111111 1111111111 1111111111 3111111111 1111111111 110 120 130 140 150 1111111111 1111131111 1111111111 1111111111 1111111111 160 170 180 190 200 1111111111 1111111111 1111111111 1111111111 1111111133 210 220 230 240 250 1311111111 1111111111 1111111110 0000000000 0000220000 260 270 280 290 300 0000000000 0020000000 2200000202 0002000000 0000222200 @26. TX 3 @ Alter relationships Used to make what are normally illegal changes to the database. That is the normal checks are not done and any item in the database can be changed independently of all others. Users need to know what they are doing because it is very easy to make a horrible mess. Always start by making a copy! By using the options here users can edit individual gel readings in contigs, move one section of a contig relative to another, break contigs, remove contigs, remove gel readings, etc. To give flexibility most of the commands do only one thing. This means that several commands may have to be executed to complete any change. At the end of this help section there are notes on removing gel readings from the database. The following options are offered: ? = HELP ! = QUIT 3 = Line change 4 = Edit single gel reading 5 = Delete contig 6 = Shift 7 = Move gel reading 8 = Rename gel reading 9 = Break a contig 1. HELP gives this information. 2. QUIT returns to the main options of SAP. 3. Line change allows the user to change the contents of any line in the file of relationships. The line is selected by number, the program prints the current line and prompts for the new line. 4. Edit allows the user to edit an individual gel reading independently of any others it may be related to. The edit positions are relative to the contig. The effect of this editing on the length of the gel reading is taken care of but, if it changes the length of a contig, or its relationship to others, this must be accounted for (if necessary) by use of the "line change" function. 5. Delete contig is a function that deletes a contig line by moving down all the contig lines above by one position. It prompts only for the line to delete. It does not delete any of the gel readings or gel reading lines for the deleted contig but it does reduce the number of contigs on line IDBSIZ by 1. 6. Shift allows the user to change all the relative positions of a set of neighbouring gel readings by some fixed value, i.e. it will shift related gel readings either left or right. It can therefore be used to change the alignment of the gel readings in a contig or as part of the process of breaking a contig into two parts (see below). It prompts for the number of the first gel reading to shift and then for the distance to move them (Note a negative value will move the gel readings left and a positive value right). It then chains rightwards (ie follows right neighbours) and shifts each gel reading, in turn, up to the end of the contig. (This means that only those gel readings from the first to shift to the rightmost are moved). It updates the length of the contig accordingly. 7. Move gel reading is a function to renumber a gel reading. It moves all the information about a gel reading on to another line. The user must specify the number of the gel reading to move and the number of the line to place it. It takes care of all the relationships. Of course gel readings must not be moved to lines occupied by other gel readings! It can be used as part of the process of removing a gel reading from the database (see below). 8. Rename gel reading is a function that is used to rename the archive names of gel readings in the database; it only changes the name in the .ARN file of the database. 9. Break contig Occasionaly it is necessary to break a contig into two parts and this can be achieved using this option. The program needs only the number of a gel reading. This is the gel reading that will become a left end after the break. That is, the break is made between this gel reading and its left neighbour. A new contig line is created so ensure that there is sufficient space in the database. Removing gel readings from contigs Gel readings can be removed from contigs if they are not essential for holding the contig together (ie are not the only gel reading covering a particular region). Suppose the gel reading to remove is gel number b with left neighbour a and right neighbour c. Using "line change" change the right neighbour of a to c, and the left neighbour of c to a. To tidy things up: suppose there are x gel readings in the database; then, using "move gel reading" move gel x to line b; then, using "line change" decrease the number of gel readings in the database (stored in the last line) by 1. @27. TX 1 @ Set display parameters Used to redefine the parameters that control the cutoff employed by the consensus calculation and quality examiner, the maximum length of each reading to include in the quality calculation, the line length used by the display function, the text window length used by the graphics options, and the graphics window length used by the graphics options. The default cutoff score is 75%. The default line length is 50 characters. For protein sequences the cutoff is always 100%. The text window used by the graphics options controls the amount of sequence listed at the crosshair position. The graphics window controls the "zoom" function. Both these windows are defined as the number of bases that should be shown, to both left and right of the crosshair. @30. TX 3 @ Auto edit a contig This function automatically changes characters in gel readings to make them agree with the consensus sequence. If employed as is intended, use of this function is not a criminal activity but a method that saves a large amount of work. All characters changed by the auto editor will appear in the gel readings as lowercase letters. The current consensus calculation cutoff score is used. Identify the contig and the section to edit. The program will display a summary of changes made. Note that it is important to understand both what the auto editor does and the order in which it does it. Before employing the auto editor users should note all the corrections that they require, so that after it has been used the corrections can be checked. The general strategy employed when collecting shotgun sequence data is to let the contigs get fairly deep, to get a printout of a contig, check problems against the films, note corrections on the printout, and make the changes using an interactive editor. In general the consensus is correct except for places where padding characters have been used to accommodate a single gel with an extra character, or where the consensus is dash. The important point for the auto editor is that most edits simply make the gel readings conform to the consensus, or remove columns of pads. The new editor does the following. 1) calculates a consensus for the contig (or part of a contig) to be edited, and then uses this consensus to direct the editing of the contig in 3 stages 2) stage 1: find and correct all places where, if the order of two adjacent characters is swapped, they will both agree with the consensus (given that they did not match the consensus before). These corrections are termed "transpositions" 3) stage 2: find and correct all places where there is a definite consensus but the gel reading has a different character. These corrections are termed "changes". 4) stage 3: delete all positions in which padding is the consensus. These corrections are termed "deletions". All changed characters are shown in lowercase letters so it will be obvious which characters have been assigned by the program (except for deletions). The number of each type of correction will be displayed. @10. TX 2 @Clear graphics Clears graphics from the screen. @11. TX 2 @Clear text Clears text from the screen. @12. TX 2 @Draw a ruler. This option allows the user to draw a ruler or scale along the x axis of the screen to help identify the coordinates of points of interest. The user can define the position of the first base to be marked (for example if the active region is 1501 to 8000, the user might wish to mark every 1000th base starting at either 1501 or 2000 - it depends if the user wishes to treat the active region as an independent unit with its own numbering starting at its left edge, or as part of the whole sequence). The user can also define the separation of the ticks on the scale and their height. If required the labelling routine can be used to add numbers to the ticks. @14. TX 2 @Reposition plots The positions of each of the plots is defined relative to a users drawing board which has size 1-10,000 in x and 1-10,000 in y. Plots for each option are drawn in a window defined by x0,y0 and xlength,ylength. Where x0,y0 is the position of the bottom left hand corner of the window, and xlength is the width of the window and ylength the height of the window. --------------------------------------------------------- 10,000 1 1 1 -------------------------------------- ^ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ylength 1 1 1 1 1 1 1 1 1 1 1 1 -------------------------------------- v 1 1 x0,y0^ 1 1 <---------------xlength--------------> 1 --------------------------------------------------------- 1 1 10,000 All values are in drawing board units (i.e. 1-10,000, 1-10,000). The default window positions are read from a file "ANALMARG" when the program is started. Users can have their own file if required. As all the plots start at the same position in x and have the same width, x0 and xlength are the same for all options. Generally users will only want to change the start level of the window y0 and its height ylength. This option allows users to change window positions whilst running the program. The routine prompts first for the number of the option that the users wishes to reposition; then for the y start and height; then for the x start and length. Note that changes to the x values affect all options. If the user types only carriage return for any value it will remain unchanged. Note that, unlike all the other programs, the boxes used to contain analytical results (eg plot quality) should not be made to overlap one another, as the function of the crosshair routine depends on which box the crosshair is in! overlap @15. TX 2 @Label a diagram This routine allows users to label any diagrams they have produced. They are asked to type in a label. When the user types carriage return to finish typing the label the cross-hair appears on the screen. The user can position it anywhere on the screen. If the user types R (for right justify) the label will be written on the diagram with its right end at the cross-hair position. If the user types L (for left justify) the label will be written on the diagram with its left end at the cross hair position. The cross-hair will then immediately reappear. The user may put the same label on another part of the diagram as before or if he hits the space bar he will be asked if he wishes to type in another label. Typical dialogue follows. ? Menu or option number=15 Type label then drive cross hair to left or right end of label position then hit "L" to write label left justified or "R" to write label right justified or the space bar to quit ? Label=delta gene missing graphics ? Label= @16. TX 2 @Display a map. This draws a map of any sequence features selected by the user. These features may be protein coding regions (CDS), tRNA genes (TRNA), promoter positions (PRM), etc. Users may define their own feature table key names. For example I find it convenient to split CDS lines into CDS1, CDS2 and CDS3 each of which contains only those sequences that code in the reading frames 1, 2 or 3. Then I can plot them at different heights on the screen ( suitable heights can be determined by using the cross-hair). The coordinates must be stored in a file in the format of an EMBL feature table. Typical dialogue follows. ? Menu or option number=16 Display a map using an EMBL feature table file ? map file name=hsegl1.ft ? feature code(e.g. CDS) =CDS X 1 + strand 2 - strand 3 both strands ? 0,1,2,3 = ? level (0-9480) (256) =4000 missing graphics ? feature code(e.g. CDS) = @7. TX 1 @Redirect output Used to direct output that would normally appear on the screen to a file. Select redirection of either text or graphics, and supply the name of the file that the output should be written to. The results from the next options selected will not appear on the screen but will be written to the file. When option 7 is selected again the file will be closed and output will again appear on the screen. @13. TX 2 @Use crosshair This option puts a steerable cross on the screen which the user drives around by using the arrow keys (or mouse). When the crosshair is visible a number of options are available if the user types one of a set of special keyboard characters. Any other characters will cause an exit from the crosshair option. The special keys are: I = Identify the nearest gel reading Z = Zoom in Q = plot Quality S = display the aligned Sequences at the crosshair position N = list the Names and Numbers of the sequences at the crosshair In order for any of these special keys to operate, the crosshair must be in an appropriate display box, and the precise function of the keys will also depend on which box the crosshair is in. If the crosshair is in the "plot all contigs" box, Z will cause a new box to appear showing all the readings for the nearest contig; Q will give the same as Z but will also produce an extra box showing the "quality" plot. If Z is hit in the "plot single contig" box, the contig will be zoomed to the current graphics window size. The zoom will be roughly centred on the crosshair position. Because of this it is possible to step along a contig by repeatedly zooming with the crosshair near to one end of the single contig display box. If I is hit the crosshair must be close to a gel reading line. If Q is hit, the quality plot will be produced for the region shown in the plot single contig box. In all cases when the "plot all contigs" box is shown, a vertical line will bisect the line the represents the relevent contig, at the current position. If the crosshair is in the plot quality box only the character "s" will operate as a special symbol. The number of bases shown in the N and S options is controlled by the current graphics text window size, and the size of the zoom window by the current graphics window size. Both are set by the parameter setting function of the general menu. @33. TX 2 @Plot single contig This option produces a schematic of a selected region of a single contig by drawing a horizontal line to represent each of its gel readings. The lines show the relative positions of each reading and also their sense. The plot is divided vertically into two sections by a line that is identified by an asterisk drawn at each end. All lines that lie above this line represent readings that are in their original sense, all lines below show readings that are in the complementary sense to their original. By use of the crosshair function the plot can be stepped through and examined in more detail. See help on crosshair. @34. TX 2 @Plot all contigs This option produces a schematic of all the contigs in a database. It does this by drawing a horizontal line to represent each of them. In order to show the ends of each contig it draws the lines for contigs at alternate heights: the first at height one, the second at height two, the third at height one, etc. The order of the contigs in the display is the same as their order in the database. By use of the crosshair function the plot can be stepped through and examined in more detail. See help on crosshair. @31. TX 3 @ Type in gel readings This option allows gel readings to be typed in at the keyboard. It creates a separate file for each gel reading and a file of file names for the batch. The sequences from each batch may be listed when they have all been entered. Users may choose to employ special keys to identify the 4 bases A,C,G and T. By default these special keys are N M , . but any other four characters may be used. If special keys are used the characters are automatically translated to A C G T before being stored on the disk. @35. TX 1 @ Find internal joins The purpose of this function is to use data already in the database to find possible joins between contigs. Joins may have been missed due to poor data or may have not been made due to repeated sequences. Where appropriate, it may be possible to find potential joins by using the data clipped off readings prior to their entry into the database. The database is checked for logical consistency. Supply a minimum initial match length, a minimum alignment block, the maximum pads per sequence, the maximum percent mismatch after alignment, the probe length. Choose if clipped data is to be used, if so define the window size for finding good data and the number of dashes allowed in the window. Processing will commence. Most of these values are used in an identical way in the autoassemble function. The others are defined below. The program strategy Take the first contig and calculate its consensus. If clipped data is being used examine all readings that are in the complementary orientation, and sufficiently near to the contigs left end, to see if they have good clipped sequence which if present, would protrude from the left end of the contig. If found add the longest such sequence to the left end of the consensus. Do the same for the right end by examining readings that are in their original orientation. If any are found add the longest extension to the right end of the consensus. Repeat the consensus calculations and extensions for all contigs hence producing an extended consensus. If clipped data is not being used simply calculate the consensus for the whole database. Now look for possible joins by processing the extended consensus in the following way. Take the last, say 100, bases (termed the "probe length" by the program) of the rightmost consensus, compare it both orientations with the extended consensus of all the other contigs. Display any sufficiently good alignments. Repeat with the left end of the rightmost contig. Do the same for the ends of all the entended contigs, always only comparing with the contigs to their left, so that the same matches do not appear twice. Good cliped data is defined by sliding a window of "Window size for good data scan" bases outwards along the sequence and stopping when "Maximum number of dashes in scan window" or more dashes appear in the window. Note that it is advisable to have some sort of cutoff because if we simply take all the data it might be so full of rubbish that we wont find any good matches. For the same reason it is worth trying the procedure with different cutoffs. An initial run using no clipped data is also recommended. Sufficiently good alignments are defined by criteria equivalent to those used in autoassemble, however here we only display alignments that pass all tests. Bugs If a small contig is wholly contained within a larger one, such that its ends are further than ("Probe length" - "Minimum initial match length") from the ends of the larger contig, and the consensus for the small contig lies to the left of the consensus for large contig, the overlap will not be discovered. (See the search stratgey). All numbering is relative to base number one in the contig: matches to the left (i.e. in the clipped data) have negative positions, matches off the right end of the contig (i.e. in the clipped data) have positions greater than that of the contig length. A typical result is shown below. Right end of contig 22 in the - sense and contig 96 Percentage mismatch after alignment = 3.0 628 638 648 658 668 678 GTGAGATGAG CATATTTAAA ATGAACCGAG CAGTTAGGAG ATATGTTGGG AGGACAAGAA ********* ********** ********** ********** ********** ********** -TGAGATGAG CATATTTAAA ATGAACCGAG CAGTTAGGAG ATATGTTGGG AGGACAAGAA -86 -76 -66 -56 -46 -36 688 698 708 718 ACATCCGGGA TACAGTCAAT AAATGAAAAA TTAATGAATT ********** ********** ****** *** ***** **** ACATCCGGGA TACAGTCAAT AAATGA-AAA TTAATTAATT -26 -16 -6 4