staden-lg/help/sap_help

 @-1. TX  0 @General

 @-2. T   0 @Screen control

 @-2. X   0 @Screen

 @-3. TX  0 @Modification

 @0.  TX -1 @SAP

        This is an  interactive  program  whose  primary  use  is  for
  managing  shotgun  sequencing  projects, but it can also be used for
  handling alignments of other sequences, including those of proteins.
  Currently   the   maximum  'gel  reading'  length  is  set  to  4096
  characters. Almost all of the information below describes the use of
  the  program  for shotgun projects, but those using the programs for
  handling other sequence alignments should interpret it  accordingly.
  The data for such a project is stored in a special type of database.
  The program contains the tools that are  required  to  type  in  gel
  readings,  screen  them  against  vector  sequences  and restriction
  sites; enter new  gel  readings  into  the  database  (automatically
  comparing  and  aligning  them). In addition it contains editors and
  functions to examine the quality of the aligned sequences.

        There  are  three  main  menus:  "general",   "graphics"   and
  "modification", and some functions have submenus.
    The general menu contains the following options:

         0 = List of menus
         ? = Help
         ! = Quit
         3 = Open a database
         4 = Edit contig
         5 = Display a contig
         6 = List a text file
         7 = Direct output to disk
         8 = Calculate a consensus
        17 = Screen against restriction enzymes
        18 = Screen against vector
        19 = Check consistency
        25 = Show relationships
        27 = set parameters
        28 = Highlight disagreements
        29 = Examine quality

  The graphics menu contains:

         0 = List of menus
         ? = Help
         ! = Quit
        10 = Clear graphics
        11 = Clear text
        12 = Draw ruler
        13 = Use cross hair
        14 = Change margins
        15 = Label diagram
        16 = Plot map
        33 = Plot single contig
        34 = Plot all contigs


  The modification menu contains:

         0 = List of menus
         ? = Help
         ! = Quit
         4 = Edit a contig
         9 = Screen edit
        20 = Auto assemble
        21 = Enter new gel reading
        22 = Join contigs
        23 = Complement a contig
        24 = Copy database
        26 = Alter relationships
        30 = Auto edit a contig
        31 = Type in gel readings
        32 = Extract gel readings

    The enter new gel reading menu contains:

         ? = Help
         ! = Quit
         3 = Complete entry
         4 = Edit contig...
         5 = Display overlap
         6 = Edit new gel reading...

     The join contig menu contains:

         ? = Help
         ! = Quit
         3 = Complete join
         4 = Edit left contig...
         5 = Display joint
         6 = Edit right contig...
         7 = Move join

     The alter relationships menu contains:

         ? = Help
         ! = Quit
         3 = Line change
         4 = Edit single gel reading...
         5 = Delete contig
         6 = Shift
         7 = Move gel reading
         8 = Rename gel reading
         9 = Break contig

     The edit menu contains:

         ? = Help
         ! = Quit
         3 = Insert
         4 = Delete
         5 = Change


        Overview of the methodology

        The shotgun sequencing strategy

        In  the  shotgun  sequencing  procedure  the  sequence  to  be
  determined   is   randomly   broken  into  fragments  of  about  400
  nucleotides in length. These fragments are cloned and then  selected
  randomly  and  their  sequences    determined.     The  relationship
  between  any  pair  of fragments is  not  known  beforehand  but  is
  found  by  comparing  their   sequences.  If  the  sequence  of  one
  found to be wholly or partially contained  within  that  of  another
  for  sufficient  length  to  distinguish  an overlap  from  a repeat
  then those two fragments can  be  joined.  The  process  of  select,
  sequence  and  compare is continued until the whole of  the  DNA  to
  be  sequenced is in one continuous well determined piece.

        Definition of a contig

        A CONTIG is a set of gel  readings   that   are   related   to
  one  another   by   overlap  of  their  sequences.  All gel readings
  belong to a contig and each contig  contains  at   least   one   gel
  reading.   The  gel  readings in a contig can be summed to produce a
  continuous consensus sequence and the length of this sequence is the
  length  of the contig.  The rules used to perform this summation are
  given  under  "the  consensus  algorithm".   At  any  stage  of    a
  sequencing  project the data will comprise a number of contigs; when
  a  project  is complete  there  should be only one  contig  and  its
  consensus  will  be  the  finished  sequence.  Note that since being
  introduced and defined as above the word "contig" has been taken  up
  by  those involved in genomic mapping. In that context the consensus
  with a  precise length is not defined.

  Introduction to the computer method

        It is useful  to  consider  the  objectives  of  a  sequencing
  project  before  outlining  how  we use the computer to help achieve
  them. The aim of a shotgun  sequencing  project  is  to  produce  an
  accurate  consensus sequence from many overlapping gel readings.  It
  is necessary to know, particularly  at  the  latter  stages  of  the
  project,  how accurate the consensus sequence is. This enables us to
  know which regions of the sequence require further work and also  to
  know  when  the  project  is  finished.   To show the quality of the
  consensus, the programs described here produce  displays  like  that
  shown below.


                             10        20        30        40        50
     -6  HINW.010    GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
         CONSENSUS   GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA

                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC

                            110       120       130       140       150
     -6  HINW.010    GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
     -3  HINW.007    GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
     -5  HINW.009                                        GGCAGCGACCGTCG
     17  HINW.999                                           AGCGACCGTCG
         CONSENSUS   GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG

                            160       170       180       190       200
     -6  HINW.010    TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
     -3  HINW.007    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
     -5  HINW.009    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
     17  HINW.999    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
     12  HINW.017                                              GTAGAGC*
         CONSENSUS   TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*

        This is an example showing the left  end  of  a  contig   from
  position   1  to  200.   Overlapping  this  region  are gel readings
  numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in  reverse  orientation
  to  their  original  reading  (denoted  by  a  minus sign). Each gel
  reading also has a name (eg HINW.010). It can  be  seen  that  in  a
  number  of  places the sequences contain characters other than A,C,G
  and T. Some  of  these  extra  characters  have  been  used  by  the
  sequencer   to  indicate  regions  of  uncertainty  in  the  initial
  interpretation of the gel reading, but the asterisks (*)  have  been
  inserted  by  the  automatic assembly function in order to align the
  sequences.  Underneath  each  50  character  block  of  gel  reading
  sequences  is the consensus derived from the sequences aligned above
  (the line labelled CONSENSUS). For most of its length the  consensus
  has a definite nucleotide assignment but in a few positions there is
  insufficient agreement between the gel readings and so  a  dash  (-)
  appears  in  the  sequence.  This  display contains all the evidence
  needed to assess the quality of the consensus: the number  of  times
  the  sequence has been determined on each strand of the DNA, and the
  individual nucleotide assignments given for each gel reading.

        So the aim is to produce the consensus sequence  and,  equally
  important,  a  display of the experimental results from which it was
  derived.

        In order to achieve this the following operations need  to  be
  performed:
  1) Interpret autoradiographs and put individual  gel  readings  into
  the computer.
  2) Check each gel reading to make sure it is not simply part of  one
  of the vectors used to clone the sequence.
  3) Check each gel reading to make sure  that  those  fragments  that
  span  the  ligation point used prior to sonication are not assembled
  as single sequences.
  4) Compare all the  remaining  gel  readings  with  one  another  to
  assemble them to produce the consensus sequence.
  5) Check the quality of the consensus and edit the sequences.
  6) When all the consensus is sufficiently well determined, produce a
  copy of it for processing by other analysis programs.

        It is very unlikely that this procedure will  only  be  passed
  through  once.   Usually steps 1 to 5 are cycled through repeatedly,
  with step 4 just adding new sequences to  those  already  assembled.
  Generally step 6 is also used in order to analyse imperfect sequence
  to check if it is the one the project intended to  sequence,  or  to
  look  for  interesting  features. Analysis of the consensus, such as
  searches for protein coding regions, can also help to find errors in
  the  sequence.  The  display  of  the overlapping gel readings shown
  above can be used  to  indicate,  not  only  the  poorly  determined
  regions,  but  also  which  clones  should be resequenced to resolve
  ambiguities, or those which can usefully be extended or sequenced in
  the reverse direction, to cover difficult regions.

        The original individual gel readings for a sequencing  project
  are  each  stored in separate files. As the gel readings are entered
  into the computer (usually in batches, say 10 from a film), the file
  names  they are given are stored in a further file, called a file of
  file names. Files of file names enable gel readings to be  processed
  in batches.

        For each sequencing project we start a project database.  This
  database  has  a  structure  specifically  designed for dealing with
  shotgun sequence data. In order to arrive  at  the  final  consensus
  sequence  many  operations  will  be performed on the sequence data.
  Individual fragments must be sequenced and compared in  both  senses
  (i.e.  both  orientations)  with  all  the  other sequences. When an
  overlap between a new gel reading and a contig are found  they  must
  be aligned and the new gel reading added to the contig. If a new gel
  reading overlaps two contigs they must be aligned and joined. Before
  the  two contigs are joined one of them may need to be turned around
  (reversed and  complemented)  so  they  are  both  in  in  the  same
  orientation.

        Clearly, keeping track of all  these  manipulations  is  quite
  complicated,  and  to  be  able  to  perform  the operations quickly
  requires careful choice of data structure and algorithms. For  these
  reasons  it  is not practicable to store the gel readings aligned as
  shown in the display above. Rather, it is more convenient  to  store
  the  sequences unassembled, and to record sufficient information for
  programs to assemble  them  during  processing.  The  data  used  to
  assemble the sequences is called relational information.

        The database comprises three  files  and  they  are  described
  under the section entitled "open database".

        Before entry into the project database each  new  gel  reading
  must  be  compared  to  look  for overlaps with all the data already
  contained within the database. This last  point  is  important:  all
  searching  for  overlaps  is between individual new gel readings and
  the data already in the database. There is no searching for overlaps
  between sequences within the database; overlaps must be found before
  new gel readings are entered into the database.

        Below I  give  an  introduction  to  how  the  sequencess  are
  processed by being passed from one function to the next.

        This program is used to start a database for the  project  and
  then the following procedure is used.

        Data in the form of individual gel readings are  entered  into
  the  computer and stored in separate files using either program this
  program or the digitizer program. Batches of these gel readings  are
  passed  to  the  screening  functions  in this program to search for
  overlaps with vector sequences  ("screen  against  vector")  or  for
  matches  to  restriction  enzyme  sites   that should not be present
  ("screen against enzymes"). Each run of  these  screening  functions
  passes  on  only  those  gel  readings  that do not contain unwanted
  sequences.  Sequences  are  passed  via  files  of  file  names  and
  eventually  are  processed by the automatic assembly function ("auto
  assemble"). This function compares each gel reading with a consensus
  of  all  the  previous  gel  readings stored in the database.  If it
  finds any overlaps it aligns the overlapping sequences by  inserting
  padding  characters,  and  then  adds  the  new  gel  reading to the
  database. Gels that overlap are added to existing contigs  and  gels
  that do not overlap any data in the database start new contigs. If a
  new gel overlaps two contigs they are joined. Any gel readings  that
  appear  to overlap but which cannot be aligned sufficiently well are
  not entered and have their names written to a  file  of  failed  gel
  reading names.

        Generally data is entered into the database in batches as just
  described.  The  program  is  also  used  to examine the data in the
  database, to enter gel readings that the automatic assembly function
  cannot  align  ("enter  new  gel reading"), and to make final edits.
  Edits to whole contigs can  be made in several  ways.  An  automatic
  editor  ("auto edit") will perform almost all edits without any user
  intervention, but the program also gives access to the system editor
  (EDT  on the VAX), through the function "screen edit", and to simple
  command driven editors ("edit contig" and "edit new  gel  reading").
  Disagreements  between  gel  readings in contigs and their consensus
  sequences can be highlighted  by  use  of  the  function  "highlight
  disagreements".

        Editing the  sequences  is  obviously  an  essential  part  of
  managing   a  sequencing  project.  Editing  is  required  when  new
  sequences are added, when contigs are joined, and when sequences are
  corrected.   A  basic part of the strategy used here is that new gel
  readings should be correctly aligned throughout their  whole  length
  when  they  are entered into the database, and that when contigs are
  joined they are edited so that they are well aligned in  the  region
  of  overlap.  Alignment can be achieved by adding padding characters
  to the sequences, and this is the way "auto assemble" operates  when
  adding new sequences to the database.

        In order to search for overlaps that may have been missed  due
  to  errors  in the gel readings, the function "extract gel readings"
  can be used to take copies of  the  gel  readings  at  the  ends  of
  contigs,  and  write  them out as separate files.  These can then be
  compared with the  database  consensus  using  the  "auto  assemble"
  function in a mode that forbids entry of data into the database, and
  any gel reading matching two contigs will indicate a join  that  has
  been  missed.  The  joins can then be made interactively using "join
  contigs". Missed matches can be found  at  this  stage  because  the
  errors in the sequences may have been corrected by new data.

        Generally the users need not concern themselves with  how  the
  relational  information  is used by the program, but it is necessary
  to know how contigs are identified. Because contigs  are  constantly
  being  changed  and  reordered  the  program  identifies them by the
  numbers of the gel readings they contain.  Whenever  users  need  to
  identify  a  contig they need only know the number or name of one of
  the gel readings it contains. Whenever the  program  asks  users  to
  identify  a  contig  or  gel reading they can type its number or its
  archive name. If they type its archive name they  must  precede  the
  name by a slash "/" symbol to denote that it is a name rather than a
  number. E.g if the  archive name is fred.gel with number  99,  users
  should  type  /fred.gel  or  99  when  asked to identify the contig.
  Generally, when it asks for the gel reading to  be  identified,  the
  program  will  offer  the user a default name, and if the user types
  only return, that contig will be accessed. When a database is opened
  the  default  contig  will  be  the  longest  one, but if another is
  accessed, it will subsequently  become the current default.

        Further information is located in the  following  places.  The
  database  files  are described under "open database". The format for
  vector  and  consensus  sequences  is  given  under   "calculate   a
  consensus", as are the uncertainty codes used in gel readings.

        The only program, other than this, relevant to  sequencing  is
  the digitizer program  and it is outlined briefly below.

        The digitiser program is used for the  initial  input  of  gel
  readings  and  for  writing a file of file names. The program uses a
  digitizer for data  entry.   A  digitizer  is  a   two   dimensional
  surface  such  as a light box which is such that if a special pen is
  pressed onto it, the pens coordinates are recorded  by  a  computer.
  These coordinates can be interpreted by a program.

        In order to read an autoradiograph placed on the light box the
  user  need  only  define the bottom of the four sequencing lanes and
  the bases to which they correspond and then use  the  pen  to  point
  to   each  successive   band  progressing  up  the gel.  The program
  examines the coordinates of each pen position to see in which of the
  four  lanes  it   lies  and  assigns  the  corresponding  base to be
  stored in the computer.  Each time the pen tip is depressed to point
  to  a  position on  the  surface of the digitizer the program sounds
  the bell on the terminal to indicate to the user that  a  point  has
  been recorded.  As the  sequence  is read the program displays it on
  the screen.
 @17. TX 1 @Screen against restriction enzymes

        Used to compare gel readings against  any  restriction  enzyme
  recognition  sequences  that  may have been used  during cloning and
  which should not be  present  in  the  data.  Works  on  single  gel
  readings  or processes batches accessed through files of file names.
  The algorithm looks  for  exact  matches  to  recognition  sequences
  stored in a file.

        The  file  containing  the  recognition  sequences   must   be
  identified.  The  user  must choose between employing a file of file
  names, or typing in the names of individual gel reading files. If  a
  file  of  file names is used the program will also create a new file
  of file names. When the option has finished operating this new  file
  will  contain the names of all those gel readings that did not match
  any of the recognition sequences. Hence it can be used  for  further
  processing  of the batch. The recognition sequences should be stored
  in a simple text file with one recognition sequence per line.
 @18. TX 1 @Screen against vector

        Used to compare gel readings against any vector sequences that
  may have been picked up during cloning. Works on single gel readings
  or processes batches accessed  through  files  of  file  names.  The
  algorithm  looks  for exact matches of length "minimum match length"
  and displays the overlapping sequences.

        The file containing the vector sequence  must  be  identified.
  The  user  must  choose  between  employing a file of file names, or
  typing in the names of individual gel reading files. If  a  file  of
  file  names  is used the program will also create a new file of file
  names. When the option has finished operating  this  new  file  will
  contain  the  names of all those gel readings that did not match the
  vector sequence. Hence it can be used for further processing of  the
  batch.The  vector  sequence  should  be stored in a simple text file
  with up to 80 characters of data per line. More than one vector  can
  be  stored  in  a single file. If so each should be preceded by a 20
  character title of the form <---m13mp8.001-----> where the <  and  >
  signs  and  the  number like .001 are obligatory. The number must be
  preceded by a dot (.) and be 3 digits long. The  total  sequence  in
  the file must be < 50,001 characters in length.
 @20. TX 2 @Auto assemble

        Compares gel readings against  the  current  contents  of  the
  database  and  produces  alignments. In its normal mode of operation
  ("entry permitted"), the function will automatically enter  the  gel
  readings  into  the  database, but if entry is not permitted it will
  only  produce  alignments.  It  works  on  single  gel  readings  or
  processes  batches  of  gel  readings accessed through files of file
  names. It is the usual way to enter data into the database.

        The function will check the database for  logical  consistency
  and  will only procede if it is OK. Choose if gel readings should be
  entered into the database, or  if  they  should  only  be  compared.
  Choose  between  using  a file of file names or typing file names on
  the keyboard. If so selected, supply the file of  file  names.  Also
  supply  a  file  of  file  names to contain the names of all the gel
  readings that fail to get entered. Select  the  entry  mode.  Normal
  assembly  is  appropriate  for  all but special cases, as is "permit
  joins". Uses for the other modes are not documented here.  Define  a
  minimum  initial match length. Define a minimum alignment block (the
  default value is taken in all but exceptional circumstances). Define
  the maximum number of paddding characters allowed to be used in each
  gel reading to help achieve alignment, and the same for  the  number
  allowed  in  the  contig  for  each  gel reading. Finally define the
  maximum percentage mismatch to be allowed for any gel reading to  be
  entered  into  the database. If for any gel reading, either of these
  last three values is exceeded the gel reading will  not  be  entered
  into the database.

        In operation the  function  takes  a  batch  of  gel  readings
  (probably   passed  on   as   a  file  of file names from one of the
  screening routines) and enters them into a database for a sequencing
  project.  It takes each  gel reading in  turn, compares  it with the
  current consensus for the database, it then produces  an   alignment
  for   any   regions   of   the   consensus   it overlaps;   if  this
  alignment is sufficiently good  it  then  edits  both  the  new  gel
  reading  and  the  sequences  it  overlaps   and   adds the new  gel
  reading to the database.  The program  then  updates  the  consensus
  accordingly and carries on to the next  gel  reading.

        All alignments are displayed and  any  gel  readings  that  do
  match but  that cannot be aligned sufficiently well have their names
  written to a file of failed gel reading names.  The  function  works
  without   any  user  intervention  and can process any number of gel
  readings in a single run.  Those  gel  readings  that  fail  can  be
  recompared  using  the  same  function  (to find the current overlap
  position) and  the user  can enter them into the  database  manually
  using  the   "enter new gel reading" option.

        Typical dialogue and output from the function is shown  below.
  (Note  that  output  for gel readings 2 - 9 has been deleted to save
  space).
  Automatic sequence assembler
  Database is logically consistent
  ? (y/n) (y) Permit entry
  ? (y/n) (y) Use file of file names
  ? File of gel reading names=demo.nam
  ? File for names of failures=demo.fail
  Select entry mode
  X  1 Perform normal shotgun assembly
     2 Put all sequences in one contig
     3 Put all sequences in new contigs
  ? Selection  (1-3) (1) =
  ? (y/n) (y) Permit joins
  ? Minimum initial match (12-4097) (15) =
  ? Minimum alignment block (2-5) (3) =
  ? Maximum pads per gel (0-25) (8) =
  ? Maximum pads per gel in contig (0-25) (8) =
  ? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Processing           1 in batch
    Gel reading name=HINW.004
    Gel reading length=   283
    Searching for overlaps
    Strand     1
    Strand     2
    No matches found
    Total matches found           1
    Padding in contig=    0 and in gel=    1
    Percentage mismatch after alignment =  1.8
    Best alignment found
           1         11         21         31         41         51
           TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
           ********** ********** ********** ********** ********** **********
           TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
           1         11         21         31         41         51
          61         71         81         91        101        111
           TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
           ********** ********** ********** ********** ********** ***** ****
           TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
          61         71         81         91        101        111
         121        131        141        151        161        171
           CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
           ********** ********** ********** ********** ********** **********
           CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
         121        131        141        151        161        171
         181        191        201        211        221        231
           GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
           ********** ********** * ******** ********** ********** **********
           GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
         181        191        201        211        221        231
         241        251        261        271        281
           CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
           ****  **** ********** ** * ***** ********** ****
           CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
         241        251        261        271        281
            Batch finished
            9 sequences processed
            0 sequences entered into database
            0 joins made


        Note that "auto assemble" cannot align protein sequences.
 @28. TX 1 @Highlight disagreements

        Used  in  the  latter  stages  of  a  project   to   highlight
  disagreements  between  individual  gel readings and their consensus
  sequences. Characters that agree with the consensus are shown  as  :
  symbols  for  the plus strand and . for the minus strand. Characters
  that disagree with the consensus are left unchanged and so stand out
  clearly. The results of this analysis are written to a file.

        Before selecting this option create a file of the  display  of
  the  contig to be "highlighted". The option will ask for the name of
  this file. Select symbols to denote "agreeing"  characters  on  each
  strand, the defaults are : and ., but any others can be used. Supply
  the name of a file in which to put the output.

        The display file needed as input for this option is created by
  selecting  "Redirect  output",   followed  immediately  by  "display
  contig", and then "Redirect output" again. The cutoff score used  in
  the  consensus  calculation  can  be  set  by  option  "set  display
  parameters". Note that for the highlight function there is  a  limit
  of  50  for  the  number  of  gel  readings  that are aligned at any
  position - ie the contig must be less than 51 gel readings  deep  at
  its  thickest point. I hope that those performing shotgun sequencing
  never reach this limit, but those using the  program  for  comparing
  sequence families might.

        Typical output from this function is shown below.

                            210       220       230       240       250
      1  HINW.004    :C::::::::::::::::::::::::::::::::::::::::::AC::::
      7  HINW.018    :*::::::::::::::::::::::::::::::::::::::::::CA::::
     -4  HINW.017                                 ...............AC....
                     G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG

                            260       270       280       290       300
      1  HINW.004    ::::::::::::*:D:::::::::::::::::::
      7  HINW.018    ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
     -4  HINW.017    ..............................................A...
      3  HINW.009    :::::::::::::::V::::::::::::::::::::::::::::*AV:::
     -6  HINW.028                            ......................A...
                     AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG

 @32. TX 3 @Extract gel readings

        Used to make copies of the aligned gel readings in a database,
  to write them into separate files, and to write a corresponding file
  of file names. It operates in two modes: either all gel readings are
  extracted, or only those at the ends of contigs.

        Choose which mode of operation is required and supply  a  file
  of file names.

        The gel readings are given their original names.  If  used  to
  extract  the  gel  readings from the ends of contigs the function is
  useful for checking for missed contig joins: the file of file  names
  can  be  used with the auto assemble function to recompare these gel
  readings, and each should only overlap one contig. Any that  overlap
  two contigs will identify possible joins.

        If the option is used to extract all the gel readings  from  a
  database,  a  subsequent  run  of "auto assemble" can reconstitute a
  database which has  been  corrupted.  This   rarely  occurs  and  is
  usually  necessesitated  by  a user  employing "alter relationships"
  incorrectly without first having made a copy.
 @1. TX 0 @Help

        Help is available on the following topics :
 @2. TX 0 Quit

        This command stops the program and is the  only  safe  way  to
  terminate  a run of the program that has altered the contents of the
  database in any way.
 @3. TX 1 @Open a database

        Opens existing databases or allows new ones to be started. The
  function  is automatically called into operation when the program is
  started but can also be selected from the general menu.

        Choose to open an existing database or start a new one, or  if
  !  is  typed  when  the  program is first started, enter the program
  without opening a database. Supply a project database name,  and  if
  it  already exists, the "version". If starting a new database define
  the database size and if it is for DNA or  protein  sequences.   The
  database  size  is  an  initial  size  for  the  database. It can be
  increased later during the project. It is the sum of the  number  of
  gel readings plus the number of contigs.

        Database names can have from one to 12 letters  and  must  not
  include  full  stop  (.).  The  database is made from three separate
  files. If the database is called FRED then  version  0  of  database
  FRED comprises files FRED.AR0, FRED.RL0 and FRED.SQ0. The version is
  the last symbol in the file names.  Only this program can read these
  files. If the "copy database" option is used it will ask the user to
  define a new "version".

        For normal use the maximum gel reading length is  set  to  512
  characters,  but  when  a  database  is  started the user may choose
  lengths of either 512, 1024, 1536..., 4096. Normally the program  is
  used  to handle DNA sequences but many of the functions also work on
  protein sequences. The choice of sequence  type  is  made  when  the
  database is started.

        The contigs are not stored on the disk as the user  sees  them
  displayed  on the screen. Each gel reading is stored with sufficient
  information about how it overlaps other gel  readings  so  that  the
  program  can  work out how to present them aligned on the screen. We
  refer to this extra data as "the relationships" and it is  explained
  below.  The database comprises 3 separate files.
  1.  a working version of each gel reading.  This is the  version  of
  the   gel   reading  that  is in the database and initially it is an
  exact copy of the original sequence (known as the archive) but it is
  edited and manipulated to align  it with other gel readings.
  2.  the file of  relationships.   This  file  contains  all  of  the
  information  that  is required to assemble the working versions into
  contigs during processing;  any manipulations on the data  use  this
  file   and  it  is  automatically  updated  at  any  time  that  the
  relationships are changed.  The  information  in  this  file  is  as
  follows:
  (A) Facts about  each   gel  reading   and   its   relationship   to
  others ("gel descriptor lines"):
  (a) the number of the gel reading   (each gel reading   is  given  a
  number  as  it  is entered into the database)
  (b) the length of the sequence from this gel reading
  (c) the position of the left end of this gel reading    relative  to
  the left end of the contig of which it is a member
  (d) the number of the next gel reading   to the  left  of  this  gel
  reading
  (e) the number of the next gel reading   to the right
  (f) the relative strandedness of this gel reading  , ie whether   it
  is  in the same sense or the complementary sense as its archive.
  (B) Facts about each contig ("contig descriptor lines"):
  (a) the length of this contig
  (b) the number of the leftmost gel reading   of this contig
  (c) the number of the rightmost gel reading   of this contig.
  (C) General facts:
  (a) the number of gel readings in the database
  (b) the number of contigs in the database.
  3.  the file of archive names.  This is simply a list of  the  names
  of each of the archive files in the database but on line number 1000
  we also store the size of the database. ie the number  of  lines  of
  information allowed in the database files. This file always has 1000
  lines but the length of the file of relationships and  the  file  of
  working  versions can be set by the user when creating a database or
  when copying from one to another.

        Structure of the database files

        1.  The file of relationships

        The file contains IDBSIZ lines of data:  the general data  are
  stored  on line IDBSIZ;   data about  gel readings  are stored  from
  line 1 downwards;  data about contigs are stored from line  IDBSIZ-1
  upwards.  A  database  of 500 lines containing 25 gel readings and 4
  contigs would have a file of relationships as is shown below.


                    ---------------------------------------------
                       1  Gel descriptor record
                       2   "      "       "
                       3   "      "       "
                       4   "      "       "
                       5   "      "       "
                       '   '      '       '
                       '   '      '       '
                      25   "      "       "
                      26  Empty record
                       '    '     '

                       '    '     '
                     495    '     '
                     496  Contig descriptor record
                     497    "        "        "
                     498    "        "        "
                     499    "        "        "
                     500   Number of gel readings=25, Number of contigs=4
                    ---------------------------------------------

            The arrangement of the data in the file of relationships

  As each new gel reading   is added into the database a new  line  is
  added to  the  end  of  the  list  of gel descriptor lines.  If this
  new gel  reading  does not overlap with any gel readings already  in
  the  database  a new contig  line  is added  to  the top of the list
  of contig lines.  If it overlaps with one contig then no new  contig
  line  need  be  added  but  if it  overlaps with  two  contigs  then
  these  two  contigs must be joined and the number  of  contig  lines
  will  be reduced by one. Then the list of contig lines is compressed
  to  leave  the empty line at the top of the list.  Initially the two
  types  of  line will move towards  one  another  but eventually,  as
  contigs  are joined, the contig descriptor lines will  move  in  the
  same  direction  as the  gel descriptor lines.   At  the  end  of  a
  project  there should  be only one contig  line.   The  database  is
  thus capable of handling a project of 998 gels.

        Structure of the working versions file

        The working versions of gel readings are stored  in   a   file
  of  IDBSIZ lines each containing 512 characters.  Gel reading number
  1 is stored on line 1, gel reading number  2 on line 2 and so on.

        Structure of the archive names file

        This file, unlike the others, always has 1000  lines  each  10
  characters  in length. Its length is fixed because line 1000 is used
  to store IDBSIZ the database size and the programs need  a  definite
  location from which to read this number.

        Safeguarding the database

        It is advisable to copy regularly (using the copy function  of
  DS) from say copy 0 to copy 1 in case of errors.

        I also recommend setting the protection codes  on  copy  0  of
  each  database  so  that users cannot delete the files without first
  resetting  the  protection  codes.  This  will  protect   you   from
  accidently  deleting  the  files.  Users  at LMB can use the PROTECT
  command for this purpose.

        The give-up options allow users to change  their  minds  about
  entering   a   new   gel   reading  or  joining  two contigs without
  affecting the file  of  relationships.   BUT  if  the   edit  contig
  option   from   either   of   these  two functions has been used the
  edits will remain even though the user has "given up".  To leave the
  files  completely unaffected  the  user  could,  if  required,  undo
  any edits before "giving up".

        There  are  various  checks  within  the  programs  to protect
  users from themselves:-
  1.  All user input is checked for  errors  -  e.g.    reference   to
  non-existent  gel readings or  contigs,  incorrect  positions in the
  contig or gel readings.
  2.  Before entering a gel reading the system checks to see if a file
  of the same name has already been entered.
  3.  Join will not allow the circularising of a contig.
  4.        Both enter and join  functions  restrict  the  region that
  the   user  is  allowed to edit (using edit contig) to the region of
  overlap.
  5. Users may escape from any point in the program.
  6. Help is available from all points in the program.


  IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
  ANYTHING  THAT  INVOLVES  CHANGING THE CONTENTS OF THE DATABASE. I.E
  DURING AUTO ASSEMBLE,  COMPLETE  ENTRY,  COMPLETE  JOIN,  COMPLEMENT
  CONTIG,  EDIT  CONTIG,  AND  SCREEN  EDIT.   This  could corrupt the
  database so badly that it is impossible to fix. The  program  should
  always be left using the QUIT option.
 @4. TX 3 @Edit

        A simple commnd driven editor  that  can  insert,  delete  and
  change  gel  reading  sequences.  Insert, delete and change commands
  will request the position at which the  edit  is  required  and  the
  number  of  characters  to  insert,  delete  or  change. The default
  character for insertions is *.

        There are three  modes  of  editing  offered  by  this  editor
  depending where it is selected from.  New gel readings can be edited
  as they are being entered into the database, contigs can  be  edited
  with  alignments  being automatically maintained, or gel readings in
  contigs can be edited without the maintenance of alignments.
  The following commands can be used.

     ? = Help
     ! = Quit
     3 = Insert
     4 = Delete
     5 = Change


        All commands request the position at which the edit should  be
  made.   (Note that the position refers to the position in the contig
  for gel readings in the database, but to the  position  in  the  gel
  reading  if you are editing a new gel reading while entering it into
  the database.)

        All commands request the number of characters to  operate  on.
  (Note  that if you are editing a contig the program will ask for the
  characters to insert into each separate gel reading, hence  allowing
  different  changes to be made to each. Also the default character is
  asterisk (*) - i.e if you include a space in the string it  will  be
  replaced  by  an  asterisk,  or  if you simply type return the whole
  string inserted will be asterisks.)
  "Change" allows  characters  in  individual  gel  readings   to   be
  replaced.   If  the  user  is  not  editing a new gel reading during
  "enter new gel reading" the program will request the  numer  of  the
  gel  reading  to  edit.   (When  editing gel readings in contigs the
  program responds with the relative  position  and   length   of  the
  selected   gel  reading  in  case  the  the user only knows the edit
  position relative to the  gel reading. (The   edit   position   must
  be relative to the contig.))
  Further notes on editing

        When you are  editing  a  contig  the  program  maintains  the
  alignments  of  the gel readings by always making the same number of
  insertions or deletions  in all the gels.  Note that these edits are
  immediately  carried  out  and  the "Quit" options of "enter new gel
  reading" and "join contigs" do not undo them.  Users must undo  them
  themselves.  Note  that  if this option has been entered from either
  "enter new gel reading" or "join contigs" the program will  restrict
  edits   to the  region  of  overlap.  DO NOT KILL THE PROGRAM DURING
  EDIT CONTIG!

        When editing a single gel reading  in  a  contig  from  "alter
  relationships"  (which  you  should  not  normally  need  to do) the
  program will correct the length of the individual gel  reading,  but
  it will not update the length of the contig if it has changed.

        The program contains better methods than this  simple  command
  driven  editor, for making multiple edits to contigs. "Screen edit",
  gives access to the system editor on your machine, and  "auto  edit"
  will edit a whole contig automatically.
 @9. TX 3 @Screen edit

        Gives access to the system editor on the machine (for  example
  EDT  on  a  VAX)  and  allows users to edit contigs. The contigs are
  presented as for "display contig" and the program will  reconstitute
  the contig's sequences and relationships  when the editor is exited.

        To screen edit a contig set the line length to 50  characters,
  select  the  contig to edit, and supply the name of a temporary file
  in which the editing will be performed.  After  a  short  pause  the
  system editor will present the first page of the file. Edit the file
  obeying the rules given below. Exit from the editor and  affirm  the
  intention  of returning the contig to the database. The program will
  put the contig back into the database.

        Rules for screen editing

        There are some limitations on the changes that can be made  to
  the contigs when using the screen editor. Users are unlikely to want
  to break the rules in order  to  achieve  changes  to  contigs,  but
  nevertheless  the  constraints need to be defined and they are given
  below.

        Alignments must be maintained during editing.  Whole lines  of
  sequence  should not be deleted or added unless the order of the gel
  readings in the contig  is  preserved.   Each  line  in  the  contig
  display  consists  of  gel  reading  numbers,  their  names  and  50
  character sections  of  sequence.  Insertions  are  limited  in  the
  following  way.  No line of sequence can be extended rightwards more
  than 10 characters beyond the end of a  full  length  line  (a  full
  length  line is 50 characters long). Only one character can be added
  to the left end of full  length  lines,  but  sections  of  sequence
  beginning  further  into  a  line can be extended leftwards up to an
  equivalent position. Do not delete any  non-sequence  lines  in  the
  file.

        Before returning the contig to the database the program checks
  that  the rules have been obeyed. If an error is found the number of
  the erroneous line in the file is displayed and the contig will  not
  be changed.
 @5. TX 1 @Display a contig

        Used to show the aligned  gel  readings  for  any  part  of  a
  contig.  The  number,  name  and strandedness of each gel reading is
  shown and the consensus is written below.

        If required identify the contig,  and then the start  and  end
  points of the region to display.

        The display can be directed  to  a  disk  file  using  "direct
  output to disk".  These files are required by options: "screen edit"
  and "highlight disagreements", and printed copies of them  are  very
  useful for marking corrections prior to using the editors.

        Below is an example showing the left  end  of  a  contig  from
  position   1  to  200.  Overlapping this region are gels 6,3,5,17and
  12; 6, 3 and 5 are in reverse orientation to their archives (denoted
  by  a  minus   sign)  There  are  a  few uncertainty codes and a few
  padding characters  in  the  working  versions,  but  the  consensus
  (shown  below  each page width) has a definite assignment for almost
  every position.

                             10        20        30        40        50
     -6  HINW.010    GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
         CONSENSUS   GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA

                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC

                            110       120       130       140       150
     -6  HINW.010    GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
     -3  HINW.007    GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
     -5  HINW.009                                        GGCAGCGACCGTCG
     17  HINW.999                                           AGCGACCGTCG
         CONSENSUS   GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG

                            160       170       180       190       200
     -6  HINW.010    TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
     -3  HINW.007    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
     -5  HINW.009    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
     17  HINW.999    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
     12  HINW.017                                              GTAGAGC*
         CONSENSUS   TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
 @6. TX 1 @List a text file

        This option allows users to list text files on the screen.  It
  can  be  used  to  read  a file containing notes, for checking files
  written to disk etc. The user is asked to type the name of the  file
  to list.
 @8. TX 1 @Calculate a consensus

        Calculates  a  consensus  sequence   either  for   the   whole
  database or for selected contigs. The consensus is written to a file
  named by the user.
  Supply a file name,  choose  between   whole  database  or  selected
  contigs.

        Symbols for uncertainty in gel readings

        In  order  to  record  uncertainties  when  reading  gels  the
  codes  shown  below can  be  used. Use  of these codes permits us to
  extract the maximum amount of data from each gel and yet record  any
  doubts   by  choice   of   code.    The program can deal with all of
  these codes and any other  characters  in  a  sequence  are  treated
  as  dash  (-) characters.

         SYMBOL                  MEANING

           1             PROBABLY        C
           2                "            T
           3                "            A
           4                "            G
           D                "            C       POSSIBLY        CC
           V                "            T          "            TT
           B                "            A          "            AA
           H                "            G          "            GG
           K                "            C          "            C-
           L                "            T          "            T-
           M                "            A          "            A-
           N                "            G          "            G-
           R             A OR G
           Y             C OR T
           5             A OR C
           6             G OR T
           7             A OR T
           8             G OR C
           -             A OR G OR C OR T
           a             A set by auto edit
           c             C set by auto edit
           g             G set by auto edit
           t             T set by auto edit
           *             padding character placed by auto assembler
            else = -

  The DNA consensus algorithm

        The "calculate  consensus"  function,  the  "display   contig"
  routine and the "show quality" option use  the rules  outlined  here
  to  calculate  a consensus  from aligned gel  readings.   Note  that
  "display  contig"  calculates a consensus for  each  page  width  it
  displays  (it  does  not use the consensus sequence file  calculated
  by the consensus function).

        We have 6 possble symbols in the consensus sequence: A,C,G,T,*
  and -. The last symbols is assigned if none of the others makes up a
  sufficient proportion of the aligned characters at any  position  in
  the contig. The following calculation is used to decide which symbol
  to place in the consensus at each position.

        Each uncertainty code contributes a score to one of  A,C,G,T,*
  and  also  to  the  total  at each point. Symbols like R and Y which
  don't correspond to a single base type contribute only to the  total
  at each point. The scores are shown below.
                definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1

                probable assignments ie 1,2,3,4 = 0.75

                other uncertainty codes including R,Y,5,6,7,8,- = 0.1


        A cutoff score of 51% to 100% is supplied by the  user.  (When
  the   program   starts   this  is  set  to  75%.  See  "set  display
  parameters").  At each position in the contig we calculate the total
  score  for  each of the 5 symbols A,C,G,T and * (denote these by Xi,
  where i=A,C,G,T or *), and also the sum of these totals (denote this
  by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
  in the consensus; otherwise - is assigned.

        Notice that S does not equal the number of times the  sequence
  has  been  determined, but is the score total, and hence we are less
  likely to put a -  in  the  consensus.  For  the  "examine  quality"
  algorithm  each  strand is treated separately but the calculation is
  the same. (It was originally different).

        Format of the consensus sequence ( and vector sequences).

        A consensus  sequence  file  may  contain  the  consensus  for
  several contigs and so we identify each of them by preceding them by
  a 20 character title. The title is of the form  <---LAMBDA.076----->
  (  where LAMBDA is the project name and gel reading number 76 is the
  leftmost gel reading to contribute to  this   consensus   sequence).
  The   angle  brackets  <>  and the three digit number precede by a .
  are important to some processing programs.
 @25. TX 1 @Show relationships

        Used to show the relationships of  the  gel  readings  in  the
  database in three ways -
  (a) All contig descriptor lines  followed  by  all  gel   descriptor
  lines.
  (b) All contigs one after the   other   sorted,   i.e.    for   each
  contig   show  its   contig  descriptor line followed by all its gel
  descriptor lines sorted on position from left to right
  (c) Selected contigs:  show the contig  line  and,  in  order, those
  gel  readings  that  cover  a  user-defined  region.  Note that this
  output can be directed to a disk file by prior  selection  of  "disk
  output".

        Below is an example showing a contig from position 1  to  689.
  The left gel reading  is number 6 and has archive name HINW.010, the
  rightmost gel  reading is number 2 and is has archive name HINW.004.
  On  each  gel  descriptor  line  is  shown:  the name of the archive
  version, the gel number, the position of the left  end  of  the  gel
  reading  relative to the left  end  of  the  contig,  the length  of
  the gel reading  (if this is negative it means that the gel  reading
  is  in  the  opposite orientation to its archive), the number of the
  gel reading   to the left and the number of the gel reading  to  the
  right.


   CONTIG LINES
   CONTIG      LINE  LENGTH               ENDS
                                       LEFT   RIGHT
                 48     689               6       2
   GEL LINES
   NAME      NUMBER POSITION LENGTH     NEIGHBOURS
                                       LEFT   RIGHT
   HINW.010       6        1   -279       0       3
   HINW.007       3       91   -265       6       5
   HINW.009       5      137   -299       3      17
   HINW.999      17      140    273       5      12
   HINW.017      12      193    265      17      18
   HINW.031      18      385   -245      12       2
   HINW.004       2      401   -289      18       0

 @21. TX 3 @Enter new gel reading

        Used to enter new gel readings into the database. The new  gel
  reading  must have previously been compared with the contents of the
  database by use of " auto assemble"  in order  to  ascertain  if  it
  overlaps any previously entered data.

        The user is expected to know: if the gel reading overlaps;  if
  so  which  contig  it overlaps; if so where it overlaps. The program
  takes the user through a series of question to establish the  nature
  of  the  overlap  and  then  displays  the overlap. The user is then
  offered a number of options,  including  editors  for  the  new  gel
  reading  and  the contig, to enable the correct alignment of the gel
  reading throughout its whole length.
  Supply the name of the gel reading file.  If the  gel  reading   has
  been  entered before the program will  not permit entry. The program
  gives the gel reading a unique  number  and  asks  if  the  sequence
  overlaps  any  data  already  in  the  database  (reported  by "auto
  assemble").  If it does not, entry is complete.  If it does  overlap
  the  dialogue  continues with the program asking if the gel readings
  overlaps "in  the  normal  sense",  if  not  it  will  automatically
  complement  the  sequence.  Then supply the number of the contig the
  gel reading overlaps (as reported by "auto assemble").

        Overlaps are divided into two types: those for which  the  new
  gel  reading  protrudes from the left end of the contig it overlaps,
  and those for which it does not. The program asks  about  this  with
  the  question "Left end of gel reading is inside contig". If this is
  true the program will go on to ask for the position in the contig of
  the  left  end of the new gel reading. If it is not true the program
  will ask for the position in the new gel reading of the left end  of
  the contig.

        Once this is completed the program will display the  first  50
  bases  of  the  overlap.  The  gel  readings in the contig and their
  consensus are displayed with the new  gel  reading  underneath.  The
  mismatches are shown by *'s on the next line down. For example:


                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACACGTC
         NEWGEL      CACAAGCGAGCGAGAGGGGCACCGTGACGTGGTCACGCCGGGGACACGTC
         MISMATCH                  *                         * *
                             10        20        30        40        50


        The program then needs to know if the position  of  the   left
  end  of  the  overlap  is  correct.   If  it is the user should type
  return, if not, 1 and the program will ask for the new position  and
  display it.
  The program now offers a number of  options  to  allow  the user  to
  align  the  new gel reading correctly over its whole length with the
  data   already   in   the   contig.    It   is    important     that
  sufficient   edits   are   made   to  the  new  gel  reading  or the
  sequences in the contig at this stage to get the alignment  correct,
  because  once entry  is completed, the alignment is fixed and cannot
  easily be changed (see "alter relationships").  Alignment   can   be
  achieved  by   making  insertions   or  deletions  but  deletion  of
  data requires the original gels to be checked.   For   this   reason
  at  entry  we usually make only insertions to achieve alignment.  We
  use X or asterisks (*) as padding characters  to  achieve  alignment
  and  so   can,  if  required, distinguish  padding  characters  from
  characters assigned from reading gels.

        The options available are:
     ? = HELP
     ! = Give up
     3 = Complete entry
     4 = Edit contig
     5 = Display overlap
     6 = Edit new gel reading


        1. HELP gives this information.

        2. Give up allows users to change their minds  about  entering
  the  new  gel reading. The program will ask the user to confirm this
  choice.

        3. Complete entry is the command to add the new gel reading to
  the  contig.  The program updates the relationships accordingly. The
  user is asked to confirm this command.

        4. Edit contig gives the user access to a simple  editor  that
  allows  insertions,  deletions and changes to be made to the contig.
  The editor  maintains  alignments  by  making  the  same  number  of
  insertions or deletions in all sequences covering the edit position.
  The  program protects the  user           by   allowing  edits  only
  within the region of overlap.

        5. Display allows display of the region of overlap only.  This
  is  defined  by the relative positions in the contig. The default is
  the whole of the region of overlap.

        6. Edit new gel reading allows  the  new  gel  reading  to  be
  edited using a simple editor.
 @23. TX 3 @Complement a contig

        This function will complement  and  reverse  all  of  the  gel
  readings   in    a  contig.     It    automatically   reverses   and
  complements  each  gel reading sequence,  reorders  left  and  right
  neighbours,   recalculates   relative  positions  and  changes  each
  strandedness.

        The only user  input  required  is  to  identify  the   contig
  to complement  by  the  number or name of a gel reading it contains.
  DO NOT KILL THE PROGRAM DURING THIS STEP!
 @22. TX 3 @Join contigs

        This function joins contigs interactively.  It allows the user
  to  align  the  ends  of  the  two  contigs  by  editing each contig
  separately.  It  is  important  that  the  alignment   achieved   is
  correct  because  once the join is completed the alignment is fixed.
  The program needs to know which two contigs to join and  where  they
  overlap.

        First which two contigs are to be  joined.   The  user  should
  identify  the two contigs. First the left contig and then the right.
  The program checks that the two contig  numbers  are  different  (it
  will not allow circles to be formed!)

        Now identify the exact position of overlap. This is defined as
  the  position  in the left contig that the leftmost character of the
  right contig overlaps.  Normally  the  position  is  established  by
  employing  the  end  gel  reading  for  option "auto assemble".  The
  overlap must be of at least  one  character.      The  program  then
  displays  the  join  showing  all  the  gel readings overlapping the
  join from the left contig, their consensus,  all  the  gel  readings
  from  the  right  contig  that  overlap  the  join,  their consensus
  and   then  asterisks  to  denote   mismatches   between   the   two
  consensuses. For example:

                           1460      1470      1480      1490      1500
     56  HINW.100    TCT*GAGCAGTGTGGGCGCTG*CCGG
     33  HINW.300    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
    -25  HINW.090    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
     19  HINW.123    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
         CONSENSUS   TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
     -6  HINW.010    TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
     -3  HINW.007                TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
     -5  HINW.009                              GCTCGGAGGGCATGAAGT*AGAGC
         CONSENSUS   TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
         MISMATCH                         *                      ******
                             10        20        30        40        50


        It  is  essential  that  the  user  aligns  the  two   contigs
  throughout  the  whole  region of overlap before completing the join
  because it is only at this stage that the two contigs can be  edited
  independently.  Once the join is completed the alignment can only be
  altered using the routines supplied by  "alter  relationships".  The
  program  offers  the user options to facilitate the alignment of the
  two contigs.  These options are:-

     ? = Help
     ! = Give up
     3 = Complete join
     4 = Edit left contig
     5 = Display joint
     6 = Edit right contig
     7 = Move join

  1. Help gives this information.
  2. Give up allows the user to return to  the  main  options  without
  completing the join. Note any edits made will remain.
  3. Complete join instructs the program to update  the  relationships
  so  that  the two contigs are joined. DO NOT KILL THE PROGRAM DURING
  COMPLETE JOIN!
  4. Edit left contig and edit right contig give access  to  a  simple
  editor  that  allows insertions, deletions and changes to be made to
  the  contigs. Help is available on editing once the  editing  option
  is  selected.  The user is only allowed to edit within the region of
  overlap and should make sure that the positions used  correspond  to
  the correct contig.
  5. Display join displays the joint as shown above.
  6. See above.
  7. Move join allows the position of the joint to be changed.
 @24. TX 1 @               Copy the database

        Used to make a copy of the database. If required the  database
  size  can  be altered using this option. The "version" of a database
  is  encoded as the last letter in the names of the three files  that
  contain the database.

        Supply a "version" number (the default is version 1),  and  if
  required  select a new size for the database. The size of a database
  is the number of lines of information it can hold. It needs  a  line
  for each gel reading and another for each contig.
 @19. TX 1 @               Check database

        Used to perform a check on  the  logical  consistency  of  the
  database. No user intervention is required.

        The following relationships are checked:
  1.       If gel reading A thinks gel reading B is its left neighbour
  does B think A  is its right neighbour?  The error message is
  "Hand holding problem for gel reading A"
  followed by  the gel descriptor lines for gel readings A and B.
  2.       Are there any contig lines with no left or  right  end  gel
  readings?  The error message is
  "Bad contig line number A"
  3.       Do the gel readings that are  described  as  left  ends  on
  contig lines agree that they are left ends?  The error message is
  "The end gel readings of contig A have outward neighbours"
  4.       Are there gel readings that are in more  than  one  contig?
  The error message is
  " Gel number A is used N times"
  5.       Are there gel readings that are not  in  any  contig?   The
  error message is
  " Gel number A is not used"
  6.       Do the relative positions of   gel  readings   agree   with
  their  position  as  defined by left and right neighbourliness?  The
  error message is
  " Gel number A with position X is left neighbour of  gel  number   B
  with position Y"
  7.       Are there any loops in   contigs?    If   so   no   further
  checking is done.  The error mesage is
  " Loop in contig n no further checking done, but gel reading numbers
  follow"
  The program  then  prints the gel  reading  numbers  in  the  looped
  contig up to the start of the loop.
  8. Are there any contigs of length <1? The error message is
  " The contig on line number x has zero length"
  9. Are there any gel readings (used in only one  contig)  that  have
  zero length? The error message is
  " Gel number N has zero length"
  Note that "auto assemble"  also uses this logical consistency  check
  and will only tolerate a "Gel number N is not used" error. Any other
  error will cause it to give up.
 @29. TX 1 @               Examine quality

        Analyses the quality of the data in a contig.  It  reports  on
  the  proportion  of the consensus that is "well determined" and will
  display a sequence of symbols  that  indicate  the  quality  of  the
  consensus at each position.

        Identify the contig to analyse, and the section  of  interest.
  The  current  consensus  calculation  cutoff  score  will be used to
  decide if each position is "well determined". In general the quality
  of  a  reading deteriorates along the length of the gel and so it is
  also possible to use a length cutoff for  the  quality  calculation.
  Only  the  data  from  the  first  section  of  each reading will be
  included in the quality calcualtion. The  length  is  altered  under
  "set parameters" and is initially set to the maximum reading length.
  A summary showing the percentage of the consensus  that  falls  into
  each category of quality is shown. Choose whether or not to have the
  quality codes for each position of the consensus displayed. They can
  be displayed as either graphics or text.

        The quality of the data depends on the number of times it  has
  been  sequenced  and the particular uncertainty codes  used  in each
  gel reading.  This function divides the data into  five  categories,
  assigning each a symbol or code:
  1.  Well determined on both strands and they agree.  code=0
  2.  Well determined on the plus strand only.  code=1
  3.  Well determined on the minus strand only.  code=2
  4.  Not well determined on either strand.  code=3
  5.  Well determined on both strands but they disagree.  code=4
  A position is "well determined" if it is assigned one of the symbols
  A,C,G,T  when  the  algorithm  described in the section "calculate a
  consensus".   The  calculation  is  performed  separately  for  each
  strand.

        If the user chooses to have the data displayed graphically the
  following  scheme  is used. A rectangular box is drawn so that the x
  coordinate  represents  the  length  of  the  contig.  The  box   is
  notionally divided vertically into 5 possible levels which are given
  the y values: -2,-1,0,1,2.  The quality  codes  attributed  to  each
  base  position are plotted as rectangles.  Each rectangle represents
  a region in which the quality codes are identical, so a single  base
  having a different code from its immediate neighbours will appear as
  a very narrow rectangle.

    Rectangle bottom and top y values

       Quality 0 rectangle from 0 to 0
       Quality 1 rectangle from 0 to 1
       Quality 2 rectangle from 0 to -1
       Quality 3 rectangle from -1 to 1
       Quality 4 rectangle from -2 to 2

        Obviously a single line  at  the  midheight  shows  a  perfect
  sequence.

        Typical dialogue is shown below.

     41.47% OK on both strands and they agree(0)
     55.48% OK on plus strand only(1)
      2.08% OK on minus strand only(2)
      0.97% Bad on both strands(3)
      0.00% OK on both strands but they disagree(4)
    ? (y/n) (y) Show sequence of codes

             10         20         30         40         50
     1111111111 1111111111 1111111111 1111111111 1111111111

             60         70         80         90        100
     1111111111 1111111111 1111111111 3111111111 1111111111

            110        120        130        140        150
     1111111111 1111131111 1111111111 1111111111 1111111111

            160        170        180        190        200
     1111111111 1111111111 1111111111 1111111111 1111111133

            210        220        230        240        250
     1311111111 1111111111 1111111110 0000000000 0000220000

            260        270        280        290        300
     0000000000 0020000000 2200000202 0002000000 0000222200

 @26. TX 3 @               Alter relationships

        Used  to  make  what  are  normally  illegal  changes  to  the
  database. That is the normal checks are not done and any item in the
  database can be changed independently of all others. Users  need  to
  know  what they are doing because it is very easy to make a horrible
  mess. Always start by making a copy!

        By using the  options  here  users  can  edit  individual  gel
  readings  in  contigs,  move  one  section  of  a contig relative to
  another, break contigs, remove contigs, remove gel readings, etc. To
  give  flexibility most of the commands do only one thing. This means
  that several commands may  have  to  be  executed  to  complete  any
  change.  At the end of this help section there are notes on removing
  gel readings from the database.

        The following options are offered:

     ? = HELP
     ! = QUIT
     3 = Line change
     4 = Edit single gel reading
     5 = Delete contig
     6 = Shift
     7 = Move gel reading
     8 = Rename gel reading
     9 = Break a contig

  1. HELP gives this information.
  2. QUIT returns to the main options of SAP.
  3. Line change
  allows the user to change the contents  of  any line in the file  of
  relationships.   The  line is selected by number, the program prints
  the current line and prompts for the new  line.
  4.   Edit
  allows   the   user   to    edit    an    individual    gel  reading
  independently of any others it may be related to. The edit positions
  are relative to the contig. The effect of this editing on the length
  of the gel reading is taken care of but, if it changes the length of
  a contig, or its relationship to others, this must be accounted  for
  (if necessary) by use of the "line change" function.
  5.  Delete  contig
  is a function that deletes a contig line  by moving  down  all   the
  contig lines above by one position.  It prompts only for the line to
  delete.  It does not  delete  any   of   the  gel  readings  or  gel
  reading  lines  for the deleted contig but it does reduce the number
  of contigs on line IDBSIZ by 1.
  6.  Shift
  allows the user to change all the relative  positions of  a set   of
  neighbouring  gel  readings by some fixed value, i.e.  it will shift
  related gel readings either left or  right.   It  can  therefore  be
  used  to  change the alignment of the gel readings in a contig or as
  part of the process of breaking a contig into two parts (see below).
  It  prompts  for  the  number  of the first gel reading to shift and
  then  for the  distance  to  move  them (Note a negative value  will
  move  the  gel readings left and a positive value right).   It  then
  chains rightwards (ie follows right neighbours) and shifts each  gel
  reading,  in  turn,  up to the  end of the contig.  (This means that
  only those gel readings from the first to shift to the rightmost are
  moved). It updates the length of the contig accordingly.
  7. Move gel reading
  is  a  function  to  renumber  a  gel  reading.  It  moves  all  the
  information  about  a  gel reading on to another line. The user must
  specify the number of the gel  reading to move and the number of the
  line  to place it. It takes care of all the relationships. Of course
  gel readings must not be  moved  to  lines  occupied  by  other  gel
  readings!  It  can  be used as part of the process of removing a gel
  reading from the database (see below).
  8.  Rename gel reading
  is a function that is used to  rename  the archive   names   of  gel
  readings   in  the  database;   it only changes the name in the .ARN
  file of the  database.

  9. Break contig

        Occasionaly it is necessary to break a contig into  two  parts
  and  this  can be achieved using this option. The program needs only
  the number of a gel reading. This is  the  gel   reading  that  will
  become  a  left  end  after  the  break.  That is, the break is made
  between this gel reading and its left neighbour. A new  contig  line
  is created so ensure that there is sufficient space in the database.
  Removing gel readings from contigs

        Gel readings can be removed  from  contigs  if  they  are  not
  essential  for  holding the contig together (ie are not the only gel
  reading covering a particular region). Suppose the  gel  reading  to
  remove  is gel number b with left neighbour a and right neighbour c.
  Using "line change" change the right neighbour of a to  c,  and  the
  left neighbour of c to a. To tidy things up: suppose there are x gel
  readings in the database; then, using "move gel reading" move gel  x
  to  line  b;  then,  using  "line change" decrease the number of gel
  readings in the database (stored in the last line) by 1.
 @27. TX 1 @  Set display parameters

        Used to  redefine  the  parameters  that  control  the  cutoff
  employed  by  the  consensus  calculation  and quality examiner, the
  maximum  length  of  each  reading  to  include   in   the   quality
  calculation,  the line length used by the display function, the text
  window length used by the graphics options, and the graphics  window
  length used by the graphics options.

        The default cutoff score is 75%. The default line length is 50
  characters. For protein sequences the cutoff is always 100%.

        The text window used by  the  graphics  options  controls  the
  amount  of  sequence  listed at the crosshair position. The graphics
  window controls the "zoom" function. Both these windows are  defined
  as  the number of bases that should be shown, to both left and right
  of the crosshair.
 @30. TX 3 @  Auto edit a contig

        This function automatically changes characters in gel readings
  to  make  them  agree with the consensus sequence. If employed as is
  intended, use of this function is not  a  criminal  activity  but  a
  method  that saves a large amount of work. All characters changed by
  the auto editor  will  appear  in  the  gel  readings  as  lowercase
  letters. The current consensus calculation cutoff score is used.

        Identify the contig and the section to edit. The program  will
  display  a  summary  of  changes  made. Note that it is important to
  understand both what the auto editor does and the order in which  it
  does  it. Before employing the auto editor users should note all the
  corrections that they require, so that  after it has been  used  the
  corrections can be checked.

        The general strategy employed when collecting shotgun sequence
  data  is  to let the contigs get fairly deep, to get a printout of a
  contig, check problems against the films, note  corrections  on  the
  printout,  and  make  the  changes  using  an interactive editor. In
  general the consensus is correct except  for  places  where  padding
  characters  have been used to accommodate a single gel with an extra
  character, or where the consensus is dash. The important  point  for
  the  auto  editor  is  that  most edits simply make the gel readings
  conform to the consensus, or remove columns of pads.

        The new editor does the following.

        1) calculates a consensus for the contig (or part of a contig)
  to  be edited, and then uses this consensus to direct the editing of
  the contig in 3 stages

        2) stage 1: find and correct all places where, if the order of
  two  adjacent  characters  is swapped, they will both agree with the
  consensus (given that they did  not  match  the  consensus  before).
  These corrections are termed "transpositions"

        3)  stage 2: find and correct all  places  where  there  is  a
  definite  consensus  but  the gel reading has a different character.
  These corrections are termed "changes".

        4) stage 3: delete all  positions  in  which  padding  is  the
  consensus. These corrections are termed "deletions".

        All changed characters are shown in lowercase  letters  so  it
  will  be  obvious which characters have been assigned by the program
  (except for deletions). The number of each type of  correction  will
  be displayed.
 @10. TX 2 @Clear graphics

        Clears graphics from the screen.
 @11. TX 2 @Clear text

        Clears  text from the screen.
 @12. TX 2 @Draw a ruler.

        This option allows the user to draw a ruler or scale along the
  x  axis  of the screen to help identify the coordinates of points of
  interest. The user can define the position of the first base  to  be
  marked  (for  example if the active region is 1501 to 8000, the user
  might wish to mark every 1000th base starting at either 1501 or 2000
  -  it  depends  if  the user wishes to treat the active region as an
  independent unit with its own numbering starting at its  left  edge,
  or  as  part  of  the  whole sequence). The user can also define the
  separation of the ticks on the scale and their height.  If  required
  the labelling routine can be used to add numbers to the ticks.
 @14. TX 2 @Reposition plots

        The positions of each of the plots is defined  relative  to  a
  users  drawing board which has size 1-10,000 in x and 1-10,000 in y.
  Plots for each option are drawn in a window  defined  by  x0,y0  and
  xlength,ylength. Where x0,y0 is the position of the bottom left hand
  corner of the window, and xlength is the width  of  the  window  and
  ylength the height of the window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The  default  window  positions are read from a file "ANALMARG" when
  the program is started. Users can have their own file  if  required.
  As  all  the plots start at the same position in x and have the same
  width, x0 and xlength are the same for all options. Generally  users
  will  only  want  to change the start level of the window y0 and its
  height ylength. This option allows users to change window  positions
  whilst  running  the  program.   The  routine  prompts first for the
  number of the option that the users wishes to reposition;  then  for
  the  y  start and height; then for the x start and length. Note that
  changes to the x values affect all options. If the user  types  only
  carriage  return  for any value it will remain unchanged. Note that,
  unlike all the other programs, the boxes used to contain  analytical
  results (eg plot quality) should not be made to overlap one another,
  as the function of the crosshair routine depends on  which  box  the
  crosshair is in!  overlap
 @15. TX 2 @Label a diagram

        This routine allows users to  label  any  diagrams  they  have
  produced.  They  are  asked  to type in a label. When the user types
  carriage return to finish typing the label the cross-hair appears on
  the  screen. The user can position it anywhere on the screen. If the
  user types R (for right justify) the label will be  written  on  the
  diagram  with  its right end at the cross-hair position. If the user
  types L (for left justify) the label will be written on the  diagram
  with  its  left end at the cross hair position.  The cross-hair will
  then immediately reappear. The  user  may  put  the  same  label  on
  another part of the diagram as before or if he hits the space bar he
  will be asked if he wishes to type in another label.

        Typical dialogue follows.
  ? Menu or option number=15
  Type label then drive cross hair to left or right end
  of label position then hit  "L"  to  write label left
  justified or  "R"  to  write label right justified or
  the space bar to quit


  ? Label=delta gene

   missing graphics

  ? Label=

 @16. TX 2 @Display a map.

        This draws a map of any  sequence  features  selected  by  the
  user.   These  features  may  be  protein coding regions (CDS), tRNA
  genes (TRNA), promoter positions (PRM), etc. Users may define  their
  own  feature  table  key  names. For example I find it convenient to
  split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
  those  sequences  that  code in the reading frames 1, 2 or 3. Then I
  can plot them at different heights on the screen ( suitable  heights
  can be determined by using the cross-hair).  The coordinates must be
  stored in a file in the format of an EMBL feature table.

        Typical dialogue follows.
  ? Menu or option number=16
   Display a map using an EMBL feature table file
  ? map file name=hsegl1.ft
  ? feature code(e.g. CDS) =CDS
  X 1 + strand
    2 - strand
    3 both strands
  ? 0,1,2,3 =
  ? level (0-9480) (256) =4000

   missing graphics

  ? feature code(e.g. CDS) =

 @7. TX 1 @Redirect output

        Used to direct output that would normally appear on the screen
  to a file.

        Select redirection of either text or graphics, and supply  the
  name of the file that the output should be written to.

        The results from the next options selected will not appear  on
  the  screen  but  will  be  written  to  the  file. When option 7 is
  selected again the file will be closed and output will again  appear
  on the screen.
 @13. TX 2 @Use crosshair
  This option puts a steerable cross on  the  screen  which  the  user
  drives around by using the arrow keys (or mouse). When the crosshair
  is visible a number of options are available if the user  types  one
  of  a  set of special keyboard characters. Any other characters will
  cause an exit from the crosshair option. The special keys are:

      I = Identify the nearest gel reading
      Z = Zoom in
      Q = plot Quality
      S = display the aligned Sequences at the crosshair position
      N = list the Names and Numbers of the sequences at the crosshair

        In order for  any  of  these  special  keys  to  operate,  the
  crosshair  must  be  in  an appropriate display box, and the precise
  function of the keys will also depend on which box the crosshair  is
  in.

        If the crosshair is in the "plot  all  contigs"  box,  Z  will
  cause  a  new box to appear showing all the readings for the nearest
  contig; Q will give the same as Z but will also produce an extra box
  showing the "quality" plot.

        If Z is hit in the "plot single contig" box, the  contig  will
  be  zoomed  to  the  current  graphics window size. The zoom will be
  roughly centred on the crosshair position. Because  of  this  it  is
  possible  to  step  along  a  contig  by repeatedly zooming with the
  crosshair near to one end of the single contig display box. If I  is
  hit  the crosshair must be close to a gel reading line. If Q is hit,
  the quality plot will be produced for the region shown in  the  plot
  single  contig  box. In all cases when the "plot all contigs" box is
  shown, a vertical line will  bisect  the  line  the  represents  the
  relevent contig, at the current position.

        If the crosshair is in the plot quality box only the character
  "s" will operate as a special symbol.

        The number of bases shown in the N and S options is controlled
  by  the  current graphics text window size, and the size of the zoom
  window by the current graphics window size.  Both  are  set  by  the
  parameter setting function of the general menu.
 @33. TX 2 @Plot single contig
  This option produces a schematic of a selected region  of  a  single
  contig  by  drawing  a  horizontal line to represent each of its gel
  readings. The lines show the relative positions of each reading  and
  also  their  sense. The plot is divided vertically into two sections
  by a line that is identified by an asterisk drawn at each  end.  All
  lines  that lie above this line represent readings that are in their
  original sense, all lines  below  show  readings  that  are  in  the
  complementary  sense  to  their  original.  By  use of the crosshair
  function the plot can  be  stepped  through  and  examined  in  more
  detail. See help on crosshair.
 @34. TX 2 @Plot all contigs
  This option produces a schematic of all the contigs in  a  database.
  It does this by drawing a horizontal line to represent each of them.
  In order to show the ends of each contig  it  draws  the  lines  for
  contigs at alternate heights: the first at height one, the second at
  height two, the third at height one, etc. The order of  the  contigs
  in the display is the same as their order in the database. By use of
  the crosshair function the plot can be stepped through and  examined
  in more detail. See help on crosshair.
 @31. TX 3 @ Type in gel readings
  This option allows gel readings to be typed in at the  keyboard.  It
  creates  a  separate  file  for  each gel reading and a file of file
  names for the batch. The sequences from each  batch  may  be  listed
  when  they have all been entered. Users may choose to employ special
  keys to identify the 4 bases A,C,G and T. By default  these  special
  keys  are  N  M  ,  .  but any other four characters may be used. If
  special keys are used the characters are automatically translated to
  A C G T before being stored on the disk.
 @35. TX 1 @ Find internal joins
  The purpose of this function is to use data already in the  database
  to  find possible joins between contigs.  Joins may have been missed
  due to poor  data  or  may  have  not  been  made  due  to  repeated
  sequences.  Where  appropriate, it may be possible to find potential
  joins by using the data clipped off readings prior  to  their  entry
  into the database.
  The database is checked for logical consistency.  Supply  a  minimum
  initial  match  length,  a minimum alignment block, the maximum pads
  per sequence, the maximum  percent  mismatch  after  alignment,  the
  probe length. Choose if clipped data is to be used, if so define the
  window size for finding good data and the number of  dashes  allowed
  in  the  window. Processing will commence.  Most of these values are
  used in an identical way in the autoassemble  function.  The  others
  are defined below.
  The program strategy
  Take the first contig and calculate its consensus. If  clipped  data
  is  being  used  examine  all readings that are in the complementary
  orientation, and sufficiently near to the contigs left end,  to  see
  if  they have good clipped sequence which if present, would protrude
  from the left end of the contig.  If  found  add  the  longest  such
  sequence to the left end of the consensus. Do the same for the right
  end by examining readings that are in their original orientation. If
  any  are  found  add  the  longest extension to the right end of the
  consensus. Repeat the consensus calculations and extensions for  all
  contigs  hence  producing  an extended consensus. If clipped data is
  not  being  used  simply  calculate  the  consensus  for  the  whole
  database.  Now  look  for  possible joins by processing the extended
  consensus in the following  way.  Take  the  last,  say  100,  bases
  (termed  the  "probe  length"  by  the  program)  of  the  rightmost
  consensus, compare it both orientations with the extended  consensus
  of  all the other contigs. Display any sufficiently good alignments.
  Repeat with the left end of the rightmost contig. Do  the  same  for
  the ends of all the entended contigs, always only comparing with the
  contigs to their left, so that the same matches do not appear twice.
  Good cliped data is defined by sliding a window of "Window size  for
  good  data scan" bases outwards along the sequence and stopping when
  "Maximum number of dashes in scan window" or more dashes  appear  in
  the  window.   Note that it is advisable to have some sort of cutoff
  because if we simply take all the  data  it  might  be  so  full  of
  rubbish  that  we wont find any good matches. For the same reason it
  is worth trying the procedure with different cutoffs. An initial run
  using  no  clipped  data  is  also  recommended.   Sufficiently good
  alignments are defined by  criteria  equivalent  to  those  used  in
  autoassemble,  however here we only display alignments that pass all
  tests.
  Bugs
  If a small contig is wholly contained within a larger one, such that
  its  ends  are further than ("Probe length" - "Minimum initial match
  length") from the ends of the larger contig, and the  consensus  for
  the small contig lies to the left of the consensus for large contig,
  the overlap will not be discovered. (See the search stratgey).
  All numbering is relative to base number one in the contig:  matches
  to  the  left  (i.e.  in  the clipped data) have negative positions,
  matches off the right end of the contig (i.e. in the  clipped  data)
  have  positions  greater  than  that of the contig length. A typical
  result is shown below.

   Right end of contig   22 in the - sense  and contig   96
   Percentage mismatch after alignment =  3.0
          628        638        648        658        668        678
            GTGAGATGAG CATATTTAAA ATGAACCGAG CAGTTAGGAG ATATGTTGGG AGGACAAGAA
             ********* ********** ********** ********** ********** **********
            -TGAGATGAG CATATTTAAA ATGAACCGAG CAGTTAGGAG ATATGTTGGG AGGACAAGAA
          -86        -76        -66        -56        -46        -36
          688        698        708        718
            ACATCCGGGA TACAGTCAAT AAATGAAAAA TTAATGAATT
            ********** ********** ****** *** ***** ****
            ACATCCGGGA TACAGTCAAT AAATGA-AAA TTAATTAATT
          -26        -16         -6          4
No results found.