staden-lg/help/bap_help

 @-1. TX  0 @General

 @-2. T   0 @Screen control

 @-2. X   0 @Screen

 @-3. TX  0 @Modification

 @0.  TX -1 @BAP

        This is an  interactive  program  whose  primary  use  is  for
  managing  shotgun  sequencing  projects, but it can also be used for
  handling alignments of other sequences, including those of proteins.
  Currently   the   maximum  'gel  reading'  length  is  set  to  4096
  characters. Almost all of the information below describes the use of
  the  program  for shotgun projects, but those using the programs for
  handling other sequence alignments should interpret it  accordingly.
  The data for such a project is stored in a special type of database.
  The program contains the tools  that  are  required  to  screen  gel
  readings  against  vector  sequences  and  restriction sites, and to
  assemble new gel readings into the database (automatically comparing
  and aligning them). In addition it contains editors and functions to
  examine the quality of the aligned sequences.

        There  are  three  main   menus:   "general",   "screen"   and
  "modification", and some functions have submenus.
    The general menu contains the following options:

         Open a database
         Display a contig
         List a text file
         Direct output
         Calculate a consensus
         Screen against restriction enzymes
         Screen against vector
         Check logical consistency
         Copy database
         Show relationships
         set parameters
         Highlight disagreements
         Examine quality
         Check Assembly
         Find read pairs

  The graphics menu contains:

         Clear graphics
         Clear text
         Draw ruler
         Use cross hair
         Change margins
         Label diagram
         Plot map
         Plot single contig
         Plot all contigs


  The modification menu contains:

         Edit contig
         Auto assemble
         Join contigs
         Complement a contig
         Alter relationships
         Extract gel readings
         Find internal joins
         Disassemble readings
         Shuffle pads
         Auto-select oligos
         Double strand

  The alter relationships menu contains:

         Cancel
         Line change
         Check logical consistency
         Remove contig
         Shift
         Move gel reading
         Rename gel reading
         Break a contig
         Remove a gel reading
         Alter raw data parameters


        Overview of the methodology

        The shotgun sequencing strategy

        In  the  shotgun  sequencing  procedure  the  sequence  to  be
  determined   is   randomly  broken  into  fragments  of  about  1000
  nucleotides in length. These fragments are cloned and then  selected
  randomly  and  their  sequences    determined.     The  relationship
  between  any  pair  of fragments is  not  known  beforehand  but  is
  found  by  comparing  their   sequences.  If  the  sequence  of  one
  found to be wholly or partially contained  within  that  of  another
  for  sufficient  length  to  distinguish  an overlap  from  a repeat
  then those two fragments can  be  joined.  The  process  of  select,
  sequence  and  compare is continued until the whole of  the  DNA  to
  be  sequenced is in one continuous well determined piece.

        Definition of a contig

        A CONTIG is a set of gel  readings   that   are   related   to
  one  another   by   overlap  of  their  sequences.  All gel readings
  belong to a contig and each contig  contains  at   least   one   gel
  reading.   The  gel  readings in a contig can be summed to produce a
  continuous consensus sequence and the length of this sequence is the
  length  of the contig.  The rules used to perform this summation are
  given  under  "the  consensus  algorithm".   At  any  stage  of    a
  sequencing  project the data will comprise a number of contigs; when
  a  project  is complete  there  should be only one  contig  and  its
  consensus  will  be  the  finished  sequence.  Note that since being
  introduced and defined as above the word "contig" has been taken  up
  by  those involved in genomic mapping. In that context the consensus
  with a  precise length is, of course, not defined.

  Introduction to the computer method

        It is useful  to  consider  the  objectives  of  a  sequencing
  project  before  outlining  how  we use the computer to help achieve
  them. The aim of a shotgun  sequencing  project  is  to  produce  an
  accurate  consensus sequence from many overlapping gel readings.  It
  is necessary to know, particularly  at  the  latter  stages  of  the
  project,  how accurate the consensus sequence is. This enables us to
  know which regions of the sequence require further work and also  to
  know  when  the  project  is  finished.   To show the quality of the
  consensus, the programs described here produce  displays  like  that
  shown below.


                             10        20        30        40        50
     -6  HINW.010    GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
         CONSENSUS   GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA

                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC

                            110       120       130       140       150
     -6  HINW.010    GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
     -3  HINW.007    GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
     -5  HINW.009                                        GGCAGCGACCGTCG
     17  HINW.999                                           AGCGACCGTCG
         CONSENSUS   GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG

                            160       170       180       190       200
     -6  HINW.010    TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
     -3  HINW.007    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
     -5  HINW.009    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
     17  HINW.999    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
     12  HINW.017                                              GTAGAGC*
         CONSENSUS   TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*

        This is an example showing the left  end  of  a  contig   from
  position   1  to  200.   Overlapping  this  region  are gel readings
  numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in  reverse  orientation
  to  their  original  reading  (denoted  by  a  minus sign). Each gel
  reading also has a name (eg HINW.010). It can  be  seen  that  in  a
  number  of  places the sequences contain characters other than A,C,G
  and T. Some  of  these  extra  characters  have  been  used  by  the
  sequencer   to  indicate  regions  of  uncertainty  in  the  initial
  interpretation of the gel reading, but the asterisks (*)  have  been
  inserted  by  the  automatic assembly function in order to align the
  sequences.  Underneath  each  50  character  block  of  gel  reading
  sequences  is the consensus derived from the sequences aligned above
  (the line labelled CONSENSUS). For most of its length the  consensus
  has a definite nucleotide assignment but in a few positions there is
  insufficient agreement between the gel readings and so  a  dash  (-)
  appears  in  the  sequence.  This  display contains all the evidence
  needed to assess the quality of the consensus: the number  of  times
  the  sequence has been determined on each strand of the DNA, and the
  individual nucleotide assignments given for each gel reading.

        So the aim is to produce the consensus sequence  and,  equally
  important,  a  display of the experimental results from which it was
  derived.

        In order to achieve this the following operations need  to  be
  performed:
  1) Put individual  gel  readings  into  the  computer.   This  might
  involved   the  manual  interpretation  of  autoradiographs  or  the
  transfer and process  of  machine-readable  files  from  fluorescent
  sequencing machines.
  2) Check each gel reading to make sure it is not simply part of  one
  of the vectors used to clone the sequence.
  3) Check each gel reading to make sure  that  those  fragments  that
  span  the  ligation point used prior to sonication are not assembled
  as single sequences.
  4) Compare all the  remaining  gel  readings  with  one  another  to
  assemble them to produce the consensus sequence.
  5) Check the quality of the consensus and edit the sequences.
  6) When all the consensus is sufficiently well determined, produce a
  copy of it for processing by other analysis programs.

        It is very unlikely that this procedure will  only  be  passed
  through  once.   Usually steps 1 to 5 are cycled through repeatedly,
  with step 4 just adding new sequences to  those  already  assembled.
  Generally step 6 is also used in order to analyse imperfect sequence
  to check if it is the one the project intended to  sequence,  or  to
  look  for  interesting  features. Analysis of the consensus, such as
  searches for protein coding regions, can also help to find errors in
  the  sequence.  The  display  of  the overlapping gel readings shown
  above can be used  to  indicate,  not  only  the  poorly  determined
  regions,  but  also  which  clones  should be resequenced to resolve
  ambiguities, or those which can usefully be extended or sequenced in
  the reverse direction, to cover difficult regions.

        The original individual gel readings for a sequencing  project
  are  each  stored in separate files. As the gel readings are entered
  into the computer (usually in batches, say 10 from a film), the file
  names  they are given are stored in a further file, called a file of
  file names. Files of file names enable gel readings to be  processed
  in batches.

        For each sequencing project we start a project database.  This
  database  has  a  structure  specifically  designed for dealing with
  shotgun sequence data. In order to arrive  at  the  final  consensus
  sequence  many  operations  will  be performed on the sequence data.
  Individual fragments must be sequenced and compared in  both  senses
  (i.e.  both  orientations)  with  all  the  other sequences. When an
  overlap between a new gel reading and a contig are found  they  must
  be aligned and the new gel reading added to the contig. If a new gel
  reading overlaps two contigs they must be aligned and joined. Before
  the  two contigs are joined one of them may need to be turned around
  (reversed and  complemented)  so  they  are  both  in  in  the  same
  orientation.

        Clearly, keeping track of all  these  manipulations  is  quite
  complicated,  and  to  be  able  to  perform  the operations quickly
  requires careful choice of data structure and algorithms. For  these
  reasons  it  is not practicable to store the gel readings aligned as
  shown in the display above. Rather, it is more convenient  to  store
  the  sequences unassembled, and to record sufficient information for
  programs to assemble  them  during  processing.  The  data  used  to
  assemble the sequences is called relational information.

        The database comprises five files and they are described under
  the section entitled "open database".

        Before entry into the project database each  new  gel  reading
  must  be  compared  to  look  for overlaps with all the data already
  contained within the database. This last  point  is  important:  all
  searching  for  overlaps  is between individual new gel readings and
  the data already in the database. There is no searching for overlaps
  between sequences within the database; overlaps must be found before
  new gel readings are entered into the database.

        Below  I  give  an  introduction  to  how  the  sequences  are
  processed by being passed from one function to the next.

        This program is used to start a database for the  project  and
  then the following procedure is used.

        Data in the form of individual gel readings are  entered  into
  the computer and stored in separate files (possibly using either the
  digitizer program GIP). Batches of these gel readings are passed  to
  the  screening functions in this program to search for overlaps with
  vector sequences (see  VEP  and  "screen  against  vector")  or  for
  matches  to  restriction  enzyme  sites   that should not be present
  ("screen against enzymes"). Each run of  these  screening  functions
  passes  on  only  those  gel  readings  that do not contain unwanted
  sequences.  Sequences  are  passed  via  files  of  file  names  and
  eventually  are  processed by the automatic assembly function ("auto
  assemble"). This function compares each gel reading with a consensus
  of  all  the  previous  gel  readings stored in the database.  If it
  finds any overlaps it aligns the overlapping sequences by  inserting
  padding  characters,  and  then  adds  the  new  gel  reading to the
  database. Gels that overlap are added to existing contigs  and  gels
  that do not overlap any data in the database start new contigs. If a
  new gel overlaps two contigs they are joined. Any gel readings  that
  appear  to overlap but which cannot be aligned sufficiently well are
  not entered and have their names written to a  file  of  failed  gel
  reading names.

        Generally data is entered into the database in batches as just
  described.  The  program  is  also  used  to examine the data in the
  database, to enter gel readings that the automatic assembly function
  cannot  align  ("auto  assemble"), and to make final edits. Edits to
  whole contigs can   be  made  using  a  mouse-driven  editor  ("edit
  contig").

        Editing the  sequences  is  obviously  an  essential  part  of
  managing   a  sequencing  project.  Editing  is  required  when  new
  sequences are added, when contigs are joined, and when sequences are
  corrected.   A  basic part of the strategy used here is that new gel
  readings should be correctly aligned throughout their  whole  length
  when  they  are entered into the database, and that when contigs are
  joined they are edited so that they are well aligned in  the  region
  of  overlap.  Alignment can be achieved by adding padding characters
  to the sequences, and this is the way "auto assemble" operates  when
  adding new sequences to the database.

        In order to search for overlaps that may have been  missed  or
  may  be  hidden  in  the  "unused  data" the function "find internal
  joins" can be used.

        Generally the users need not concern themselves with  how  the
  relational  information  is used by the program, but it is necessary
  to know how contigs are identified. Because contigs  are  constantly
  being  changed  and  reordered  the  program  identifies them by the
  numbers of the gel readings they contain.  Whenever  users  need  to
  identify  a  contig they need only know the number or name of one of
  the gel readings it contains. Whenever the  program  asks  users  to
  identify  a  contig  or  gel reading they can type its number or its
  archive name. If they type its archive name they  must  precede  the
  name by a slash "/" symbol to denote that it is a name rather than a
  number. E.g if the  archive name is fred.gel with number  99,  users
  should  type  /fred.gel  or  99  when  asked to identify the contig.
  Generally, when it asks for the gel reading to  be  identified,  the
  program  will  offer  the user a default name, and if the user types
  only return, that contig will be accessed. When a database is opened
  the  default  contig  will  be  the  longest  one, but if another is
  accessed, it will subsequently  become the current default.

        Further information is located in the  following  places.  The
  database  files  are described under "open database". The format for
  vector  and  consensus  sequences  is  given  under   "calculate   a
  consensus", as are the uncertainty codes used in gel readings.

        The digitiser program is used for the  initial  input  of  gel
  readings  and  for  writing a file of file names. The program uses a
  digitizer for data  entry.   A  digitizer  is  a   two   dimensional
  surface  such  as a light box which is such that if a special pen is
  pressed onto it, the pens coordinates are recorded  by  a  computer.
  These coordinates can be interpreted by a program.

        In order to read an autoradiograph placed on the light box the
  user  need  only  define the bottom of the four sequencing lanes and
  the bases to which they correspond and then use  the  pen  to  point
  to   each  successive   band  progressing  up  the gel.  The program
  examines the coordinates of each pen position to see in which of the
  four  lanes  it   lies  and  assigns  the  corresponding  base to be
  stored in the computer.  Each time the pen tip is depressed to point
  to  a  position on  the  surface of the digitizer the program sounds
  the bell on the terminal to indicate to the user that  a  point  has
  been recorded.  As the  sequence  is read the program displays it on
  the screen.
 @17. TX 1 @Screen against enzymes

        Used to compare gel readings against  any  restriction  enzyme
  recognition  sequences  that  may have been used  during cloning and
  which should not be  present  in  the  data.  Works  on  single  gel
  readings  or processes batches accessed through files of file names.
  The algorithm looks  for  exact  matches  to  recognition  sequences
  stored in a file.

        The  file  containing  the  recognition  sequences   must   be
  identified.  The  user  must choose between employing a file of file
  names, or typing in the names of individual gel reading files. If  a
  file  of  file names is used the program will also create a new file
  of file names. When the option has finished operating this new  file
  will  contain the names of all those gel readings that did not match
  any of the recognition sequences. Hence it can be used  for  further
  processing  of the batch. The recognition sequences should be stored
  in a simple text file with one recognition sequence per line.
 @18. TX 1 @Screen against vector

        Used to compare gel readings against any vector sequences that
  may  have  been  picked  up  during  cloning and which have not been
  removed by vep. It Works on single gel readings or processes batches
  accessed  through files of file names. The algorithm looks for exact
  matches  of  length  "minimum  match  length"   and   displays   the
  overlapping sequences.

        The file containing the vector sequence  must  be  identified.
  The  user  must  choose  between  employing a file of file names, or
  typing in the names of individual gel reading files. If  a  file  of
  file  names  is used the program will also create a new file of file
  names. When the option has finished operating  this  new  file  will
  contain  the  names of all those gel readings that did not match the
  vector sequence. Hence it can be used for further processing of  the
  batch.  The  vector  sequence should be stored in a simple text file
  with up to 80 characters of data per line. More than one vector  can
  be  stored  in  a single file. If so each should be preceded by a 20
  character title of the form <---m13mp8.0001----> where the <  and  >
  signs  and  the number like .0001 are obligatory. The number must be
  preceded by a dot (.) and be 4 digits long. The  total  sequence  in
  the file must be < 500,001 characters in length.
 @20. TX 3 @Auto assemble

        Compares gel readings against  the  current  contents  of  the
  database  and  produces  alignments. In its normal mode of operation
  ("entry permitted"), the function will automatically enter  the  gel
  readings into the database.

        New assembly suboption.  However if entry is not permitted the
  reads  won't  be entered but the program will produce alignments and
  (optionally) save each reading name and  its  best  alignment  score
  (percentage mismatch) in a file. When used in this mode, the program
  will include in  the  alignment  the  poor  quality  data  for  each
  reading.  These  files  of names can then be sorted into score order
  and then used for assembly, hence forcing the  readings  that  align
  best to be entered into the database first.  End of new suboption.

        The routine works on single gel readings or processes  batches
  of gel readings accessed through files of file names. It is the only
  way to enter data into the database.

        The function will check the database for  logical  consistency
  and  will only proceed if it is OK. Choose if gel readings should be
  entered into the database, or  if  they  should  only  be  compared.
  Choose  between  using  a file of file names or typing file names on
  the keyboard. If so selected, supply the file of  file  names.  Also
  supply  a  file  of  file  names to contain the names of all the gel
  readings that fail to get entered. Select  the  entry  mode.  Normal
  assembly  is  appropriate  for  all but special cases, as is "permit
  joins". Uses for the other modes are not documented here.  Define  a
  minimum  initial  match length. Define the maximum number of padding
  characters allowed to be used in each gel reading  to  help  achieve
  alignment,  and  the  same  for the number allowed in the contig for
  each gel reading. Finally define the maximum percentage mismatch  to
  be  allowed  for any gel reading to be entered into the database. If
  for any gel reading, either of these last three values  is  exceeded
  the gel reading will not be entered into the database.

        In operation the  function  takes  a  batch  of  gel  readings
  (probably   passed  on   as   a  file  of file names from one of the
  screening routines) and enters them into a database for a sequencing
  project.  It takes each  gel reading in  turn, compares  it with the
  current consensus for the database, it then produces  an   alignment
  for   any   regions   of   the   consensus   it overlaps;   if  this
  alignment is sufficiently good  it  then  edits  both  the  new  gel
  reading  and  the  sequences  it  overlaps   and   adds the new  gel
  reading to the database.  The program  then  updates  the  consensus
  accordingly and carries on to the next  gel  reading.

        All alignments are displayed and  any  gel  readings  that  do
  match but  that cannot be aligned sufficiently well have their names
  written to a file of failed gel reading names.  The  function  works
  without   any  user  intervention  and can process any number of gel
  readings in a single run.  Those  gel  readings  that  fail  can  be
  recompared  using  the  same  function  (to find the current overlap
  position) and  the user  can enter them into the database using  the
  "put  all  readings  in new contigs" assembly option and then joined
  using "join contigs".

        Typical dialogue and output from the function is shown  below.
  (Note  that  output  for gel readings 2 - 9 has been deleted to save
  space).
  Automatic sequence assembler
  Database is logically consistent
  ? (y/n) (y) Permit entry
  ? (y/n) (y) Use file of file names
  ? File of gel reading names=demo.nam
  ? File for names of failures=demo.fail
  Select entry mode
  X  1 Perform normal shotgun assembly
     2 Put all sequences in one contig
     3 Put all sequences in new contigs
  ? Selection  (1-3) (1) =
  ? (y/n) (y) Permit joins
  ? Minimum initial match (12-4097) (15) =
  ? Maximum pads per gel (0-25) (8) =
  ? Maximum pads per gel in contig (0-25) (8) =
  ? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Processing           1 in batch
    Gel reading name=HINW.004
    Gel reading length=   283
    Searching for overlaps
    Strand     1
    Strand     2
    No matches found
    Total matches found           1
    Padding in contig=    0 and in gel=    1
    Percentage mismatch after alignment =  1.8
    Best alignment found
           1         11         21         31         41         51
           TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
           ********** ********** ********** ********** ********** **********
           TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
           1         11         21         31         41         51
          61         71         81         91        101        111
           TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
           ********** ********** ********** ********** ********** ***** ****
           TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
          61         71         81         91        101        111
         121        131        141        151        161        171
           CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
           ********** ********** ********** ********** ********** **********
           CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
         121        131        141        151        161        171
         181        191        201        211        221        231
           GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
           ********** ********** * ******** ********** ********** **********
           GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
         181        191        201        211        221        231
         241        251        261        271        281
           CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
           ****  **** ********** ** * ***** ********** ****
           CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
         241        251        261        271        281
            Batch finished
            9 sequences processed
            0 sequences entered into database
            0 joins made


        Note that "auto assemble" cannot align protein sequences.
 @28. TX 1 @Highlight disagreements

        Used  in  the  latter  stages  of  a  project   to   highlight
  disagreements  between  individual  gel readings and their consensus
  sequences. This display is also  availbale  in  the  contig  editor.
  Characters  that agree with the consensus are shown as : symbols for
  the plus strand and . for the minus strand. Characters that disagree
  with  the consensus are left unchanged and so stand out clearly. The
  results of this analysis are written to a file.

        Before selecting this option create a file of the  display  of
  the  contig to be "highlighted". The option will ask for the name of
  this file. Select symbols to denote "agreeing"  characters  on  each
  strand, the defaults are : and ., but any others can be used. Supply
  the name of a file in which to put the output.

        The display file needed as input for this option is created by
  selecting  "Redirect  output",   followed  immediately  by  "display
  contig", and then "Redirect output" again. The cutoff score used  in
  the  consensus  calculation  can  be  set  by  option  "set  display
  parameters". Note that for the highlight function there is  a  limit
  of  50  for  the  number  of  gel  readings  that are aligned at any
  position - ie the contig must be less than 51 gel readings  deep  at
  its  thickest point. I hope that those performing shotgun sequencing
  never reach this limit, but those using the  program  for  comparing
  sequence families might.

        Typical output from this function is shown below.

                            210       220       230       240       250
      1  HINW.004    :C::::::::::::::::::::::::::::::::::::::::::AC::::
      7  HINW.018    :*::::::::::::::::::::::::::::::::::::::::::CA::::
     -4  HINW.017                                 ...............AC....
                     G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG

                            260       270       280       290       300
      1  HINW.004    ::::::::::::*:D:::::::::::::::::::
      7  HINW.018    ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
     -4  HINW.017    ..............................................A...
      3  HINW.009    :::::::::::::::V::::::::::::::::::::::::::::*AV:::
     -6  HINW.028                            ......................A...
                     AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG

 @32. TX 3 @Extract gel readings

        Used to make copies of the aligned gel readings in a database,
  to write them into separate files, and to write a corresponding file
  of file names. It operates in two modes: either all gel readings are
  extracted, or only those at the ends of contigs.

        Choose which mode of operation is required and supply  a  file
  of file names.

        The gel readings are given their original names.

        If the option is used to extract all the gel readings  from  a
  database,  a  subsequent  run  of "auto assemble" can reconstitute a
  database which has  been  corrupted.  This   rarely  occurs  and  is
  usually  necessitated  by  a  user   employing "alter relationships"
  incorrectly without first having made a copy.
 @1. TX 0 @Help

        Help is available on the following topics :
 @2. TX 0 @Quit

        This command stops the program and is the  only  safe  way  to
  terminate  a run of the program that has altered the contents of the
  database in any way.
 @3. TX 1 @Open a database

        Opens existing databases or allows new ones to be started. The
  function  is automatically called into operation when the program is
  started but can also be selected from the general menu.

        Choose to open an existing database or start a new one, or  if
  !  is  typed  when  the  program is first started, enter the program
  without opening a database. Supply a project database name,  and  if
  it  already exists, the "version". If starting a new database define
  the database size and if it is for DNA or  protein  sequences.   The
  database  size  is  an  initial  size  for  the  database. It can be
  increased later during the project. It is the sum of the  number  of
  gel readings plus the number of contigs. The current maximum size is
  8000.

        Database names can have from one to 12 letters  and  must  not
  include  full  stop  (.).  The  database  is made from five separate
  files. If the database is called FRED then  version  0  of  database
  FRED  comprises  files  FRED.AR0,  FRED.RL0,  FRED.SQ0, FRED.TG0 and
  FRED.CC0. The version is the last symbol in the  file  names.   Only
  this  program can read these files. If the "copy database" option is
  used it will ask the user to define a new "version".

        For normal use the maximum gel reading length is  set  to  512
  characters,  but  when  a  database  is  started the user may choose
  lengths of either 512, 1024, 1536..., 4096. Normally the program  is
  used  to handle DNA sequences but many of the functions also work on
  protein sequences. The choice of sequence  type  is  made  when  the
  database is started.

        The contigs are not stored on the disk as the user  sees  them
  displayed  on the screen. Each gel reading is stored with sufficient
  information about how it overlaps other gel  readings  so  that  the
  program  can  work out how to present them aligned on the screen. We
  refer to this extra data as "the relationships" and it is  explained
  below.  The database comprises 5 separate files.
  1.  a working version of each gel reading.  This is the  version  of
  the   gel   reading  that  is in the database and initially it is an
  exact copy of the original sequence (known as the archive) but it is
  edited and manipulated to align  it with other gel readings.
  2.  the file of  relationships.   This  file  contains  all  of  the
  information  that  is required to assemble the working versions into
  contigs during processing;  any manipulations on the data  use  this
  file   and  it  is  automatically  updated  at  any  time  that  the
  relationships are changed.  The  information  in  this  file  is  as
  follows:
  (A) Facts about  each   gel  reading   and   its   relationship   to
  others ("gel descriptor lines"):
  (a) the number of the gel reading   (each gel reading   is  given  a
  number  as  it  is entered into the database)
  (b) the length of the sequence from this gel reading
  (c) the position of the left end of this gel reading    relative  to
  the left end of the contig of which it is a member
  (d) the number of the next gel reading   to the  left  of  this  gel
  reading
  (e) the number of the next gel reading   to the right
  (f) the relative strandedness of this gel reading  , ie whether   it
  is  in the same sense or the complementary sense as its archive.
  (B) Facts about each contig ("contig descriptor lines"):
  (a) the length of this contig
  (b) the number of the leftmost gel reading   of this contig
  (c) the number of the rightmost gel reading   of this contig.
  (C) General facts:
  (a) the number of gel readings in the database
  (b) the number of contigs in the database.
  3.  the file of archive names.  This is simply a list of  the  names
  of each of the archive files in the database.
  4. the file of tags (annotation). This consists of linked  lists  of
  tag  information  for  each  sequences  in  the  database.  Tags are
  created by the user as annotation, or by xdap as records of edits or
  for  storing  cutoff  information.   As  the number of tags can grow
  without limit, so can this file.  For each gel  there  is  a  header
  record,  which contains the record number of the start of the linked
  list for that gel. On line  IDBSIZ  there  is  a  record  containing
  information  about  the file such as its present length and if there
  are any free "tag" slots to be reused in the file.  5. the  file  of
  comments  (annotation).   This  consists  of linked lists of comment
  fragments.  Comments are created by the user as a  message  attached
  to  annotation,  or  by  the  system  to  store  cutoff information.
  Comments are character strings of any length.  Comments longer  than
  40 characters are broken up into fragments, each 40 characters long,
  and are chained together in a link list.  As the number of  comments
  can grow without limit, so can this file.

        Structure of the database files

        1.  The file of relationships

        The file contains IDBSIZ lines of data:  the general data  are
  stored  on line IDBSIZ;   data about  gel readings  are stored  from
  line 1 downwards;  data about contigs are stored from line  IDBSIZ-1
  upwards.  A  database  of 500 lines containing 25 gel readings and 4
  contigs would have a file of relationships as is shown below.


                    ---------------------------------------------
                       0  Info about the database size
                       1  Gel descriptor record
                       2   "      "       "
                       3   "      "       "
                       4   "      "       "
                       5   "      "       "
                       '   '      '       '
                       '   '      '       '
                      25   "      "       "
                      26  Empty record
                       '    '     '

                       '    '     '
                     495    '     '
                     496  Contig descriptor record
                     497    "        "        "
                     498    "        "        "
                     499    "        "        "
                     500   Number of gel readings=25, Number of contigs=4
                    ---------------------------------------------

            The arrangement of the data in the file of relationships

  As each new gel reading   is added into the database a new  line  is
  added to  the  end  of  the  list  of gel descriptor lines.  If this
  new gel  reading  does not overlap with any gel readings already  in
  the  database  a new contig  line  is added  to  the top of the list
  of contig lines.  If it overlaps with one contig then no new  contig
  line  need  be  added  but  if it  overlaps with  two  contigs  then
  these  two  contigs must be joined and the number  of  contig  lines
  will  be reduced by one. Then the list of contig lines is compressed
  to  leave  the empty line at the top of the list.  Initially the two
  types  of  line will move towards  one  another  but eventually,  as
  contigs  are joined, the contig descriptor lines will  move  in  the
  same  direction  as the  gel descriptor lines.   At  the  end  of  a
  project  there should  be only one contig  line.   The  database  is
  thus capable of handling a project of 998 gels.

        2.  Structure of the working versions file

        The working versions of gel readings are stored  in   a   file
  of  NGELS  lines  each  containing  MAXGEL  characters.  Gel reading
  number 1 is stored on line 1, gel reading number  2 on line 2 and so
  on.  NGELS  is the current number of readings and MAXGEL the maximum
  reading length.

        3.  Structure of the archive names file

        This file has NGELS lines of 16 characters.

        4.  Structure of the tag file

        This file initially starts with IDBSIZ lines, and is  expanded
  as  new tags are created.  Information about the length of the file,
  and which tag records are reusable is  stored  on  line  IDBSIZ.   A
  database of 500 lines would have a file of tags as shown below.

                    ---------------------------------------------
                       1  Tag descriptor record
                       2   "      "       "
                       3   "      "       "
                       4   "      "       "
                       5   "      "       "
                       '   '      '       '
                       '   '      '       '
                     497   "      "       "
                     498   "      "       "
                     499   "      "       "
                     500   Length of file=N, Free list=0
                     501  Tag record
                     502   "   "
                     503   "   "
                       '   '   '
                       '   '   '
                     N-2   "   "
                     N-1   "   "
                       N  Tag record
                    ---------------------------------------------

            The arrangement of the data in the tag file

  As each new tag is added to the database, a check  is  made  in  the
  file  descriptor  record  at  line  IDBSIZ.  If the list of reusable
  records is 0, the file is extended by one line.  Otherwise  the  new
  tag  is  assigned  to record at the head of the freelist.  When tags
  are deleted, they are added to the free list in the file  descriptor
  record.

        5.  Structure of the comment file

        This file initially starts with 1 line, and is expanded as new
  annotation  is  created.   Information about the length of the file,
  and which comment records are reusable is stored on the first line.

                    ---------------------------------------------
                       1  Length of file=N, Free list=0
                       2  Comment fragment
                       3   "       "
                       4   "       "
                       '   '       '
                       '   '       '
                     N-2   "       "
                     N-1   "       "
                       N  Comment fragment
                    ---------------------------------------------

            The arrangement of the data in the comment file

  As each new comment is added to the database, a check is made in the
  file descriptor record at line 1. If the list of reusable records is
  0, the file is extended to hold the new comment. Otherwise  the  new
  comments  is  assigned  to  records  starting  with  the head of the
  freelist.  When comments are  deleted,  the  discarded  records  are
  added to the free list in the file descriptor record.

        There  are  various  checks  within  the  programs  to protect
  users from themselves:-
  1.  All user input is checked for  errors  -  e.g.    reference   to
  non-existent  gel readings or  contigs,  incorrect  positions in the
  contig or gel readings.
  2.  Before entering a gel reading the system checks to see if a file
  of the same name has already been entered.
  3.  Join will not allow the circularising of a contig.
  5. Users may escape from any point in the program.
  6. Help is available from all points in the program.


  IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
  ANYTHING  THAT  INVOLVES  CHANGING THE CONTENTS OF THE DATABASE. I.E
  DURING AUTO ASSEMBLE, COMPLETE JOIN, COMPLEMENT  CONTIG,  SAVE  EDIT
  CONTIG.   This  could  corrupt  the  database  so  badly  that it is
  impossible to fix. The program should always be left using the  QUIT
  option.
 @4. TX 3 @Edit contig

        The Contig Editor is a mouse-driven editor  that  can  insert,
  delete and change gel reading sequences.

        The Contig Editor allows scrolling from one end of a contig to
  the  other  using the scroll bar and scroll buttons. Action of mouse
  button presses when the mouse pointer is in the scroll bar:

      Middle Mouse Button      Set editor position
      Left   Mouse Button      Scroll forward one screenful
      Right  Mouse Button      Scroll backwards one screenful

  The four scroll buttons operate as follows:

      "<<"                     Scroll left half a screenful
      "<"                      Scroll left one character
      ">"                      Scroll right one character
      ">>"                     Scroll right half a screenful

        The Editor cursor can  be  positioned  anywhere  in  the  edit
  window  by  moving the mouse pointer over the character of interest,
  then pressing the left mouse button. The Editor cursor can  also  be
  moved by using the direction arrow keys.

        The editor operates in two  main  edit  modes  -  Replace  and
  Insert. Replace allows a character to be replaced by another. Insert
  allows characters to  be  inserted  into  a  gel  reading  sequence.
  Characters  are entered by typing them from the keyboard. Only valid
  characters are permitted.  Characters can be deleted by  positioning
  the cursor one character to the right, then pressing the delete key.
  Normally Insert and Delete apply to the consensus line of the contig
  ONLY.  This  restraint  can  be overridden by using the "Super Edit"
  mode of operation, THOUGH IT IS NOT RECOMMENDED.

        Edits can also be performed on the consensus, though they  are
  restricted  to  insertion  and deletion of padding characters ("*").
  These edits also have special meanings.  A deletion will delete  ALL
  characters  at the position to the left of the cursor in the contig,
  and move the relative positions of all  sequences  starting  to  the
  right  of the cursor position left one character.  An insertion will
  insert the character typed ("*") into ALL gel reading  sequences  at
  the  cursors position in the contig, and move the relative positions
  of all sequences starting to the right of the cursor position  right
  one character.

        The effect of the last edit can  be  undone  by  pressing  the
  "Undo" button at the top of the editor window.

        The cursor  will  automatically  be  positioned  at  the  next
  problem  when  the  "Find Next Problem" button is selected. The next
  problem is where the consensus shows either an ambiguity ("-") or  a
  pad ("*") character.

        The edits to the contig can be saved by  pressing  the  "Leave
  Editor"  button and replying "Yes" to the prompt to "Save changes?".
  As no changes are made to the working copy of your database til this
  point it is possible to abort the editor if the edit session ends up
  in an unsatisfactory state (ie if you've stuffed it up!)


 Displaying Traces

        The original data from which the gel reading  sequences  where
  derived  can  be seen by double clicking (two quick clicks) with the
  middle mouse button on the area  of  interest.  The  trace  will  be
  displayed  with  the  point  clicked  at  the  centre  of  the trace
  viewport.

        All traces that are displayed are maintained  in  one  window,
  called  the  Trace Manager. The Trace Manager will only display four
  traces maximum. When four traces are already being managed and a new
  one is requested, the one at the top of the Trace Manager is removed
  and the new one is added to  the  bottom.   Traces  can  be  removed
  individually  by  using  the  "quit" button in the panel next to the
  trace.


 Extending Reads Using Cutoff Information

        Sequence data read in from  Automated  Fluorescent  sequencing
  machines trace files processed through the program ted will have the
  discarded sequence (vector at start and poor read at end)  available
  to  the  contig editor. To display the cutoff information, press the
  "Display Cutoff" button at the top of the editor window.  The cutoff
  sequence appears in grey. This sequence can be incorporated into the
  editable sequence, by moving the cutoff position. This  is  done  by
  positioning  the  cursor  at  the end of the gel sequence, and using
  Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of  cutoff.
  The Meta key is a diamond on the Sun keyboard.


 Pop-up menu

        A pop-up menu is revealed by depressing the "Control"  key  on
  the  keyboard  and  at the same time pressing the left mouse button.
  The menu has the following functions:

      Search
      Highlight Disagreements
      Save Contig
      Create Tag
      Edit Tag
      Delete Tag
      Select Oligo

  "Highlight Disaggreements" simply toggles between the normal display
  showing  the  current  base  assignments and one in which only those
  assignments that differ from the consensus are shown.
  "Save Contig" is described above.  Searching and operations on  tags
  are described below.


  Searching

        Selecting "Search" brings up a window which can remain present
  during normal editor operation. The window allows the user to select
  the direction of search, the type of search and a  value  to  search
  on.   The value is entered into the value text window. Then pressing
  the "search" button performs the search. If successful,  the  cursor
  is  positioned  and  centred  accordingly. An audible tone indicates
  failure.  Pressing the "ok" button removes the  search  window.  The
  search  window  is  automatically  removed when the contig editor is
  exited.

  There are seven different search modes:

  1. Search by position

  This positions the cursor at the numeric position specified  in  the
  value  text  window.  Eg  a  value of "1234" causes the cursor to be
  placed at base number 1234 in the contig. Positioning withing a  gel
  reading  is achieved by prefixing the number with the "@" character,
  eg "@123" positions the cursor at base 123 of the sequence in  which
  the  cursor  lies.  Relative positions can be specified by prefixing
  the number with a plus or minus character. Eg "+1234"  will  advance
  the  cursor 1234 bases. If possible, the cursor is positioned within
  the same sequence.  The direction buttons  have  no  effect  on  the
  operation of "search by position".

  2. Search by reading name

  This positions the cursor  at  the  left  end  of  the  gel  reading
  specified  in the value text window. If the value is prefixed with a
  slash is is assumed to be  a  gel  reading  name.  Otherwise  it  is
  assumed to be a gel reading number. Eg "123" positions the cursor at
  the left end of gel reading number 123.  "/a16a12.s1"  positions  at
  the  start  of reading a16a12.s1. If the value was "/a16" the cursor
  is positioned at the first reading which  starts  with  "a16".   The
  direction  buttons  have  no  effect  on the operation of "search by
  position".

  3. Search by tag type.

  This positions the cursor at the start of the next tag which has the
  the  same  type  as  specified by the type value menu. To change the
  type, select off the menu that pops up when the mouse is clicked  on
  the  button  labeled  "Type:".  The  search  can be performed either
  forwards or backwards of the current cursor position.  To  find  all
  tags, use "search by annotation", with a null text value string.

  4. Search by annotation.

  This positions the cursor at the start of the next tag which  has  a
  comment  containing  the  string specified in the value text window.
  The search performed is a regular  expression  search,  and  certain
  characters  have  special meaning. Be careful when your value string
  contains ".", "*", "[", "^" or "$".  The  search  can  be  performed
  either forwards or backwards from the current cursor position.

  5. Search by sequence.

  This positions the cursor at the start of the next piece of sequence
  that  matches  the  value  specified  in  the text value window. The
  search is for an exact match, which means the case of  value  string
  is   important.   The  search  is  performed  on  the  gel  readings
  themselves, rather than the consensus sequence. The  search  can  be
  performed  either  forwards  or  backwards  from  the current cursor
  position.

  6. Search by problem.

  This positions the  cursor  at  the  next  place  in  the  consensus
  sequence  which  is  not  an "A", "C", "G" or "T". The search can be
  performed either forwards  or  backwards  from  the  current  cursor
  position.

  7. Search by quality

  This positions the  cursor  at  the  next  place  in  the  consensus
  sequence  where the consensus calculation for each strand disagrees.
  When only sequences on one strand is present, the search  will  stop
  at  every  base.  The  search  can  be  performed either forwards or
  backwards from the current cursor position.


 Annotation

        Parts of a sequence can be annotated, to record the  positions
  of  primers used for walking, or to mark sites, such as compressions
  that have caused problems during sequencing.  The consensus sequence
  CANNOT be annotated.

        To annotate a piece of  sequence  first  select  the  part  of
  sequence  using  the  mouse  buttons.  Use  the left mouse button to
  position the start of the selection, and while this button is  being
  held  down, move the mouse to extend.  The selection can be extended
  further using the right mouse button.

        To create annotation, invoke the pop-up menu, and  select  the
  "Create Tag" function. A small "tag editor" will appear which allows
  you to select the type of the annotation from a pull-down menu,  and
  specify  a  comment  if desired.  To select a new type pull down the
  Type menu, and select the entry desired.  To enter a comment, simply
  type  into  the  text  window  in the tag editor.  The annotation is
  created when the "Leave" button on the tag editor, and is  displayed
  in the colour defined in the tag database file (TAGDB).

        To edit existing annotation, position the cursor with the left
  mouse  button  on  the tag, and select the "Edit Tag" off the pop-up
  menu.  This invokes the tag editor, and  changes  to  the  type  and
  comment  of  the annotation can be made. The tag is updated when the
  "Leave" button is pressed.

        To delete an existing annotation, position the cursor with the
  left  mouse  button  on the tag, and select the "Delete Tag" off the
  pop-up menu.


 NOTE:

        As the Contig Editor is a very powerful tool, it  is  possible
  that  the  alignment  of  the gel reading sequences has unexpectedly
  been disrupted.  This can easily happen to parts of the contig  that
  lie to the right of the screen if excessive use has been made of the
  "Super Edit" facility.  Until familiar with "Super  Edit"  it  would
  benefit  the  sequencer  to  quickly  scan  through the contig after
  editing to check that bad alignments have not been created.

  Selecting Oligos ----------------

  1. Open the oligo selection window, by selecting "Select Oligo" from
  the contig editor popup menu.
  2. Position the cursor to where you want the  oligo  to  be  chosen.
  While  the  oligo  selection  window is visible, you will still have
  complete control over positioning  and  editing  within  the  contig
  editor.
  3. Indicate the strand for which you require an oligo. This is  done
  by   toggling  the  direction  arrow  ("----->"  or  "<------"),  if
  necessary.
  3. Press the "Find Oligos" button to find all suitable  oligos  (See
  "Oligo  selection" below.)  Information for the closest oligo to the
  cursor position is given in the output text window.  In  the  contig
  editor the position of the oligo is marked by a temporary tag on the
  consensus. The window is recentered if the oligo is off the  screen.
  Selecting  "Display Selection Information" will print a short report
  on the numbers  of  oligos  considered  and  rejected  during  oligo
  selection.
  4. If this oligo is  not  suitable  (it  may  have  been  previously
  chosen,  and  found  to  be unsuitable by experimentation, say), the
  next closest oligo can be viewed by pressing "Select Next".
  5. Suitable templates are automatically identified for the currently
  displayed  oligo  (See  "Template selection" below.) By default, the
  template is that closest to the oligo site. If  the  choice  is  not
  suitable  (it  may  be  known  to  be  a poor quality template, say)
  another can be chosen from the  "Choose  Template  for  this  Oligo"
  menu.   Templates that do not appear on the menu can be specified by
  selecting "other". However, the template  must  be  on  the  correct
  strand and be upstream of the oligo.
  6. A tag can be created for the current oligo by pressing the button
  "Create a tag for this oligo". The annotation for this tag holds the
  name of the template and  the  oligo  primer  sequence.   There  are
  fields   to  allow  the  user  to  specify  their  own  primer  name
  ("serial#") and comments ("flags") for this tag. An example of oligo
  tag annotation:
          serial#=
          template=a16a9.s1
          sequence=CGTTATGACCTATATTTTGTATG
          flags=

  7. The oligo selection window is closed when "Create a tag for  this
  oligo" or "Quit" is selected.
  Oligo selection:
  ----------------
  The oligo selection engine is the one used in the program OSP. It is
  described in some detail in:
  Hillier,  L.,  and  Green,  P.  (1991).  "OSP:  an   oligonucleotide
  selection program," PCR Methods and Applications, 1:124-128.
  The parameters controlling the selection of oligos can be changed in
  the "Oligo Selection Parameters" window. The weights controlling the
  scoring of selected oligos can be changed in  the  "Oligo  Selection
  Weights" window.
  By default, the oligos are selected from a window  that  extends  40
  bases  either  side  of  the  cursor.  The size and location of this
  window relative to  the  cursor  position  can  be  changed  in  the
  "Parameters" window.
  In xbap oligos are ranked according to their proximity to the cursor
  position, rather than by their scores.
  Template selection:
  -------------------
  For simplicity, each reading is considered to represent a  template.
  In  practise,  many  readings  can  be  made  of  the same template.
  Suitable templates that are identified are those that:

      1. are in the appropriate sense,
      2. have 5' ends that start upstream of the oligo,
  and 3. are sufficiently close to the oligo to be useful.

  This last criterion relates to the insert  size  for  the  subclones
  used  for  sequencing  and the average reading length. A template is
  considered useful if a full reading can be made from it, taking into
  account  both  of  these  factors.  The  default insert size is 1000
  bases, and the default average reading length is  400  bases.  These
  values can be changed in the "Parameters" window.
 @5. TX 1 @Display a contig

        Used to show the aligned  gel  readings  for  any  part  of  a
  contig.  The  number,  name  and strandedness of each gel reading is
  shown and the consensus is written below.

        If required identify the contig,  and then the start  and  end
  points of the region to display.

        The display can be directed  to  a  disk  file  using  "direct
  output to disk".

        Below is an example showing the left  end  of  a  contig  from
  position   1  to  200.  Overlapping this region are gels 6,3,5,17and
  12; 6, 3 and 5 are in reverse orientation to their archives (denoted
  by  a  minus   sign)  There  are  a  few uncertainty codes and a few
  padding characters  in  the  working  versions,  but  the  consensus
  (shown  below  each page width) has a definite assignment for almost
  every position.

                             10        20        30        40        50
     -6  HINW.010    GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
         CONSENSUS   GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA

                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC

                            110       120       130       140       150
     -6  HINW.010    GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
     -3  HINW.007    GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
     -5  HINW.009                                        GGCAGCGACCGTCG
     17  HINW.999                                           AGCGACCGTCG
         CONSENSUS   GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG

                            160       170       180       190       200
     -6  HINW.010    TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
     -3  HINW.007    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
     -5  HINW.009    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
     17  HINW.999    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
     12  HINW.017                                              GTAGAGC*
         CONSENSUS   TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
 @6. TX 1 @List a text file

        This option allows users to list text files on the screen.  It
  can  be  used  to  read  a file containing notes, for checking files
  written to disk etc. The user is asked to type the name of the  file
  to list.
 @8. TX 1 @Calculate a consensus

        Calculates  a  consensus  sequence   either  for   the   whole
  database or for selected contigs. The consensus is written to a file
  named by the user.
  Supply a file name,  choose  between   whole  database  or  selected
  contigs.

        Symbols for uncertainty in gel readings

        In  order  to  record  uncertainties  when  reading  gels  the
  codes  shown  below can  be  used. Use  of these codes permits us to
  extract the maximum amount of data from each gel and yet record  any
  doubts   by  choice   of   code.    The program can deal with all of
  these codes and any other  characters  in  a  sequence  are  treated
  as  dash  (-) characters.

         SYMBOL                  MEANING

           1             PROBABLY        C
           2                "            T
           3                "            A
           4                "            G
           D                "            C       POSSIBLY        CC
           V                "            T          "            TT
           B                "            A          "            AA
           H                "            G          "            GG
           K                "            C          "            C-
           L                "            T          "            T-
           M                "            A          "            A-
           N                "            G          "            G-
           R             A OR G
           Y             C OR T
           5             A OR C
           6             G OR T
           7             A OR T
           8             G OR C
           -             A OR G OR C OR T
           a             A
           c             C
           g             G
           t             T
           *             padding character placed by auto assembler
            else = -

  The DNA consensus algorithm

        The "calculate  consensus"  function,  the  "display   contig"
  routine and the "show quality" option use  the rules  outlined  here
  to  calculate  a consensus  from aligned gel  readings.   Note  that
  "display  contig"  calculates a consensus for  each  page  width  it
  displays  (it  does  not use the consensus sequence file  calculated
  by the consensus function).

        We  have  6  possible  symbols  in  the  consensus   sequence:
  A,C,G,T,*  and -. The last symbols is assigned if none of the others
  makes up a sufficient proportion of the aligned  characters  at  any
  position  in the contig. The following calculation is used to decide
  which symbol to place in the consensus at each position.

        Each uncertainty code contributes a score to one of  A,C,G,T,*
  and  also  to  the  total  at each point. Symbols like R and Y which
  don't correspond to a single base type contribute only to the  total
  at each point. The scores are shown below.
                definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1

                probable assignments ie 1,2,3,4 = 0.75

                other uncertainty codes including R,Y,5,6,7,8,- = 0.1

        A cutoff score of 51% to 100% is supplied by the  user.  (When
  the   program   starts   this  is  set  to  75%.  See  "set  display
  parameters").  At each position in the contig we calculate the total
  score  for  each of the 5 symbols A,C,G,T and * (denote these by Xi,
  where i=A,C,G,T or *), and also the sum of these totals (denote this
  by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
  in the consensus; otherwise - is assigned.

        Notice that S does not equal the number of times the  sequence
  has  been  determined, but is the score total, and hence we are less
  likely to put a -  in  the  consensus.  For  the  "examine  quality"
  algorithm  each  strand is treated separately but the calculation is
  the same. (It was originally different).

        Format of the consensus sequence ( and vector sequences).

        A consensus  sequence  file  may  contain  the  consensus  for
  several contigs and so we identify each of them by preceding them by
  a 20 character title. The title is of the form  <---LAMBDA.0076---->
  (  where LAMBDA is the project name and gel reading number 76 is the
  leftmost gel reading to contribute to  this   consensus   sequence).
  The   angle  brackets  <>  and the 4 digit number precede by a . are
  important to some processing programs.
 @25. TX 1 @Show relationships

        Used to show the relationships of  the  gel  readings  in  the
  database in three ways -
  (a) All contig descriptor lines  followed  by  all  gel   descriptor
  lines.
  (b) All contigs one after the   other   sorted,   i.e.    for   each
  contig   show  its   contig  descriptor line followed by all its gel
  descriptor lines sorted on position from left to right
  (c) Selected contigs:  show the contig  line  and,  in  order, those
  gel  readings  that  cover  a  user-defined  region.  Note that this
  output can be  directed  to  a  disk  file  by  prior  selection  of
  "redirect output".

        Below is an example showing a contig from position 1  to  689.
  The left gel reading  is number 6 and has archive name HINW.010, the
  rightmost gel  reading is number 2 and is has archive name HINW.004.
  On  each  gel  descriptor  line  is  shown:  the name of the archive
  version, the gel number, the position of the left  end  of  the  gel
  reading  relative to the left  end  of  the  contig,  the length  of
  the gel reading  (if this is negative it means that the gel  reading
  is  in  the  opposite orientation to its archive), the number of the
  gel reading   to the left and the number of the gel reading  to  the
  right.


   CONTIG LINES
   CONTIG      LINE  LENGTH               ENDS
                                       LEFT   RIGHT
                 48     689               6       2
   GEL LINES
   NAME      NUMBER POSITION LENGTH     NEIGHBOURS
                                       LEFT   RIGHT
   HINW.010       6        1   -279       0       3
   HINW.007       3       91   -265       6       5
   HINW.009       5      137   -299       3      17
   HINW.999      17      140    273       5      12
   HINW.017      12      193    265      17      18
   HINW.031      18      385   -245      12       2
   HINW.004       2      401   -289      18       0

 @23. TX 3 @Complement a contig

        This function will complement  and  reverse  all  of  the  gel
  readings   in    a  contig.     It    automatically   reverses   and
  complements  each  gel reading sequence,  reorders  left  and  right
  neighbours,   recalculates   relative  positions  and  changes  each
  strandedness.

        The only user  input  required  is  to  identify  the   contig
  to complement  by  the  number or name of a gel reading it contains.
  DO NOT KILL THE PROGRAM DURING THIS STEP!
 @22. TX 3 @          Join contigs

        This function joins contigs interactively using a mouse driven
  editor.   The operation of this editor is very similar to the Contig
  Editor described in "Edit".

        It allows the user  to align the ends of the  two  contigs  by
  editing  each contig separately.  It is important that the alignment
  achieved   is  correct  because  once  the  join  is  completed  the
  alignment  is fixed.  The program needs to know which two contigs to
  join.

        First specify which two contigs are to be  joined.   The  user
  should  identify  the  two  contigs. The program checks that the two
  contig numbers are different  (it  will  not  allow  circles  to  be
  formed!)

        The Join Editor consists of  two  Contig  Editors  in  between
  which  is sandwiched a disagreement box. This disagreement box shows
  exclamation marks to denote mismatches between the two consensuses.

        For example, the display will look something like this:

                           1460      1470      1480      1490      1500
     56  HINW.100    TCT*GAGCAGTGTGGGCGCTG*CCGG
     33  HINW.300    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
    -25  HINW.090    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
     19  HINW.123    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
         CONSENSUS   TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
         MISMATCH                         !                      !!!!!!
                             10        20        30        40        50
     -6  HINW.010    TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
     -3  HINW.007                TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
     -5  HINW.009                              GCTCGGAGGGCATGAAGT*AGAGC
         CONSENSUS   TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC


        The overlap must be of at least one character.  Use the scroll
  bar  and  the  scroll buttons (`<<',`<',`>',and`>>') for positioning
  the relative positions of the two contigs.

        The join position can be fixed in  position  by  pressing  the
  `lock' button at the top of the Join Editor.  Locking allows the two
  contigs to be scrolled as one when using the scroll bar and buttons,
  the left ends always in the same position relative to each other.

        Once locked, it is best to proceed  to  the  right  along  the
  contigs,  inserting padding characters (`*') into the consensuses to
  minimise the disagreements.

        It  is  essential  that  the  user  aligns  the  two   contigs
  throughout  the  whole  region of overlap before completing the join
  because it is only at this stage that the two contigs can be  edited
  independently.  Once the join is completed the alignment can only be
  altered using the routines supplied by "alter relationships".

        The join can be  completed  by  pressing  the  `Leave  Editor'
  button.  The  percentage  mismatch  is  displayed,  and  the user is
  required to confirm that they want to perform the join.
 @24. TX 1 @               Copy the database

        Used to make a copy of the database. If required the  database
  size  can  be altered using this option. The "version" of a database
  is  encoded as the last letter in the names of the five  files  that
  contain the database.

        Supply a "version" number (the default is version 1),  and  if
  required  select a new size for the database. The size of a database
  is the number of lines of information it can hold. It needs  a  line
  for each gel reading and another for each contig.
 @19. TX 1 @               Check database

        Used to perform a check on  the  logical  consistency  of  the
  database.  No  user  intervention  is  required.  If  selected "with
  dialogue" the program also checks for any sections of the  consensus
  that contain 15 dashes in 20 characters.

        The following relationships are checked:
  1.       If gel reading A thinks gel reading B is its left neighbour
  does B think A  is its right neighbour?  The error message is
  "Hand holding problem for gel reading A"
  followed by  the gel descriptor lines for gel readings A and B.
  2.       Are there any contig lines with no left or  right  end  gel
  readings?  The error message is
  "Bad contig line number A"
  3.       Do the gel readings that are  described  as  left  ends  on
  contig lines agree that they are left ends?  The error message is
  "The end gel readings of contig A have outward neighbours"
  4.       Are there gel readings that are in more  than  one  contig?
  The error message is
  " Gel number A is used N times"
  5.       Are there gel readings that are not  in  any  contig?   The
  error message is
  " Gel number A is not used"
  6.       Do the relative positions of   gel  readings   agree   with
  their  position  as  defined by left and right neighbourliness?  The
  error message is
  " Gel number A with position X is left neighbour of  gel  number   B
  with position Y"
  7.       Are there any loops in   contigs?    If   so   no   further
  checking is done.  The error message is
  " Loop in contig n no further checking done, but gel reading numbers
  follow"
  The program  then  prints the gel  reading  numbers  in  the  looped
  contig up to the start of the loop.
  8. Are there any contigs of length <1? The error message is
  " The contig on line number x has zero length"
  9. Are there any gel readings (used in only one  contig)  that  have
  zero length? The error message is
  " Gel number N has zero length"
  Note that "auto assemble"  also uses this logical consistency  check
  and will only tolerate a "Gel number N is not used" error. Any other
  error will cause it to give up.
 @29. TX 1 @               Examine quality

        Analyses the quality of the data in a contig.  It  reports  on
  the  proportion  of the consensus that is "well determined" and will
  display a sequence of symbols  that  indicate  the  quality  of  the
  consensus at each position.

        Identify the contig to analyse, and the section  of  interest.
  The  current  consensus  calculation  cutoff  score  will be used to
  decide if each position is "well determined". In general the quality
  of  a  reading deteriorates along the length of the gel and so it is
  also possible to use a length cutoff for  the  quality  calculation.
  Only  the  data  from  the  first  section  of  each reading will be
  included in the quality calculation. The  length  is  altered  under
  "set parameters" and is initially set to the maximum reading length.
  A summary showing the percentage of the consensus  that  falls  into
  each category of quality is shown. Choose whether or not to have the
  quality codes for each position of the consensus displayed. They can
  be displayed as either graphics or text.

        The quality of the data depends on the number of times it  has
  been  sequenced  and the particular uncertainty codes  used  in each
  gel reading.  This function divides the data into  five  categories,
  assigning each a symbol or code:
  1.  Well determined on both strands and they agree.  code=0
  2.  Well determined on the plus strand only.  code=1
  3.  Well determined on the minus strand only.  code=2
  4.  Not well determined on either strand.  code=3
  5.  Well determined on both strands but they disagree.  code=4
  A position is "well determined" if it is assigned one of the symbols
  A,C,G,T  when  the  algorithm  described in the section "calculate a
  consensus".   The  calculation  is  performed  separately  for  each
  strand.

        If the user chooses to have the data displayed graphically the
  following  scheme  is used. A rectangular box is drawn so that the x
  coordinate  represents  the  length  of  the  contig.  The  box   is
  notionally divided vertically into 5 possible levels which are given
  the y values: -2,-1,0,1,2.  The quality  codes  attributed  to  each
  base  position are plotted as rectangles.  Each rectangle represents
  a region in which the quality codes are identical, so a single  base
  having a different code from its immediate neighbours will appear as
  a very narrow rectangle.

    Rectangle bottom and top y values

       Quality 0 rectangle from 0 to 0
       Quality 1 rectangle from 0 to 1
       Quality 2 rectangle from 0 to -1
       Quality 3 rectangle from -1 to 1
       Quality 4 rectangle from -2 to 2

        Obviously a single line  at  the  midheight  shows  a  perfect
  sequence.

        Typical dialogue is shown below.

     41.47% OK on both strands and they agree(0)
     55.48% OK on plus strand only(1)
      2.08% OK on minus strand only(2)
      0.97% Bad on both strands(3)
      0.00% OK on both strands but they disagree(4)
    ? (y/n) (y) Show sequence of codes

             10         20         30         40         50
     1111111111 1111111111 1111111111 1111111111 1111111111

             60         70         80         90        100
     1111111111 1111111111 1111111111 3111111111 1111111111

            110        120        130        140        150
     1111111111 1111131111 1111111111 1111111111 1111111111

            160        170        180        190        200
     1111111111 1111111111 1111111111 1111111111 1111111133

            210        220        230        240        250
     1311111111 1111111111 1111111110 0000000000 0000220000

            260        270        280        290        300
     0000000000 0020000000 2200000202 0002000000 0000222200

 @26. TX 3 @              Alter relationships

        Used  to  make  what  are  normally  illegal  changes  to  the
  database. That is the normal checks are not done and any item in the
  database can be changed independently of all others. Users  need  to
  know  what they are doing because it is very easy to make a horrible
  mess. Always start by making a copy!

        By using the options here users can  move  one  section  of  a
  contig  relative  to  another, break contigs, remove contigs, remove
  gel readings, etc. To give flexibility most of the commands do  only
  one  thing. This means that several commands may have to be executed
  to complete any change.

        The following options are offered:

     Cancel
     Line change
     Check logical consistency
     Remove contig
     Shift
     Move gel reading
     Rename gel reading
     Break a contig
     Remove a gel reading
     Alter raw data parameters

  1. QUIT returns to the main options of BAP.
  3. Line change
  allows the user to change the contents  of  any line in the file  of
  relationships.   The  line is selected by number, the program prints
  the current line and prompts for the new  line.
  4.  Check logical consistency
  5. Remove a contig
  This function removes a contig and all its gel  readings.  The  user
  specifies any reading in the contig.
  6.  Shift
  allows the user to change all the relative  positions of  a set   of
  neighbouring  gel  readings by some fixed value, i.e.  it will shift
  related gel readings either left or  right.   It  can  therefore  be
  used   to  change the alignment of the gel readings in a contig.  It
  prompts for the number of the first gel reading to shift  and   then
  for  the   distance  to  move  them (Note a negative value will move
  the gel readings left and a positive value right).   It  then chains
  rightwards  (ie  follows  right  neighbours)  and  shifts  each  gel
  reading,  in  turn,  up to the  end of the contig.  (This means that
  only those gel readings from the first to shift to the rightmost are
  moved). It updates the length of the contig accordingly.
  7. Move gel reading
  is  a  function  to  renumber  a  gel  reading.  It  moves  all  the
  information  about  a  gel reading on to another line. The user must
  specify the number of the gel  reading to move and the number of the
  line  to place it. It takes care of all the relationships. Of course
  gel readings must not be  moved  to  lines  occupied  by  other  gel
  readings!
  8.  Rename gel reading
  is a function that is used to  rename  the archive   names   of  gel
  readings   in  the  database;   it only changes the name in the .ARN
  file of the  database.

  9. Break contig

        Occasionally it is necessary to break a contig into two  parts
  and  this  can be achieved using this option. The program needs only
  the number of a gel reading. This is  the  gel   reading  that  will
  become  a  left  end  after  the  break.  That is, the break is made
  between this gel reading and its left neighbour. A new  contig  line
  is created so ensure that there is sufficient space in the database.
  10. Removing gel readings from contigs

        Gel  readings  can  be  removed  from  contigs.  If  they  are
  essential  for  holding  the  contig  together  (ie are the only gel
  reading covering a particular region), the program will create a new
  contig.

  11. Alter raw data parameters

        Allows the user to edit the individual  raw  data  parameters,
  such  as  the  left  and  right  cutoff  lengths and the name of the
  machine readable trace file.  The user must specify the gel line  to
  modify,  and  provide  new values for the length of the raw sequence
  including cutoff lengths, the left cutoff position,  the  length  of
  the original working sequence, the machine type, and the name of the
  raw data file, where these values change.
 @27. TX 1 @  Set display parameters

        Used to  redefine  the  parameters  that  control  the  cutoff
  employed  by  the  consensus  calculation  and quality examiner, the
  maximum  length  of  each  reading  to  include   in   the   quality
  calculation,  the line length used by the display function, the text
  window length used by the graphics options, and the graphics  window
  length used by the graphics options.

        The default cutoff score is 75%. The default line length is 50
  characters. For protein sequences the cutoff is always 100%.

        The text window used by  the  graphics  options  controls  the
  amount  of  sequence  listed at the crosshair position. The graphics
  window controls the "zoom" function. Both these windows are  defined
  as  the number of bases that should be shown, to both left and right
  of the crosshair.
 @30. TX 3 @  Shuffle pads

        One weakness of the alignment strategy used  is  that  padding
  characters  are  not  always  aligned  by the assembly routine. This
  function attempts to align padding characters using  a  very  simply
  strategy.  It  does  not  solve  all pad alignment problems but is a
  useful first step during cleaning-up operations.
 @10. TX 2 @Clear graphics

        Clears graphics from the screen.
 @11. TX 2 @Clear text

        Clears  text from the screen.
 @12. TX 2 @Draw a ruler.

        This option allows the user to draw a ruler or scale along the
  x  axis  of the screen to help identify the coordinates of points of
  interest. The user can define the position of the first base  to  be
  marked  (for  example if the active region is 1501 to 8000, the user
  might wish to mark every 1000th base starting at either 1501 or 2000
  -  it  depends  if  the user wishes to treat the active region as an
  independent unit with its own numbering starting at its  left  edge,
  or  as  part  of  the  whole sequence). The user can also define the
  separation of the ticks on the scale and their height.  If  required
  the labelling routine can be used to add numbers to the ticks.
 @14. TX 2 @Reposition plots

        The positions of each of the plots is defined  relative  to  a
  users  drawing board which has size 1-10,000 in x and 1-10,000 in y.
  Plots for each option are drawn in a window  defined  by  x0,y0  and
  xlength,ylength. Where x0,y0 is the position of the bottom left hand
  corner of the window, and xlength is the width  of  the  window  and
  ylength the height of the window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The  default  window  positions are read from a file "ANALMARG" when
  the program is started. Users can have their own file  if  required.
  As  all  the plots start at the same position in x and have the same
  width, x0 and xlength are the same for all options. Generally  users
  will  only  want  to change the start level of the window y0 and its
  height ylength. This option allows users to change window  positions
  whilst  running  the  program.   The  routine  prompts first for the
  number of the option that the users wishes to reposition;  then  for
  the  y  start and height; then for the x start and length. Note that
  changes to the x values affect all options. If the user  types  only
  carriage  return  for any value it will remain unchanged. Note that,
  unlike all the other programs, the boxes used to contain  analytical
  results (eg plot quality) should not be made to overlap one another,
  as the function of the crosshair routine depends on  which  box  the
  crosshair is in!
 @15. TX 2 @Label a diagram

        This routine allows users to  label  any  diagrams  they  have
  produced.  They  are  asked  to type in a label. When the user types
  carriage return to finish typing the label the cross-hair appears on
  the  screen. The user can position it anywhere on the screen. If the
  user types R (for right justify) the label will be  written  on  the
  diagram  with  its right end at the cross-hair position. If the user
  types L (for left justify) the label will be written on the  diagram
  with  its  left end at the cross hair position.  The cross-hair will
  then immediately reappear. The  user  may  put  the  same  label  on
  another part of the diagram as before or if he hits the space bar he
  will be asked if he wishes to type in another label.

        Typical dialogue follows.
  ? Menu or option number=15
  Type label then drive cross hair to left or right end
  of label position then hit  "L"  to  write label left
  justified or  "R"  to  write label right justified or
  the space bar to quit


  ? Label=delta gene

   missing graphics

  ? Label=

 @16. TX 2 @Display a map

        This is disabled!
 @7. TX 1 @Redirect output

        Used to direct output that would normally appear on the screen
  to a file and to create postscript output.

        Select redirection of either text or graphics, and supply  the
  name of the file that the output should be written to.

        The results from the next options selected will not appear  on
  the  screen  but  will  be  written  to  the  file. When option 7 is
  selected again the file will be closed and output will again  appear
  on the screen.
 @13. TX 2 @Use crosshair

        This option puts a steerable cross on  the  screen  which  the
  user  drives  around  by  using  the arrow keys (or mouse). When the
  crosshair is visible a number of options are available if  the  user
  types  one  of  a  set  of  special  keyboard  characters. Any other
  characters will cause an exit from the crosshair option. The special
  keys are:

      I = Identify the nearest gel reading
      Z = Zoom in
      Q = plot Quality
      S = display the aligned Sequences at the crosshair position
      N = list the Names and Numbers of the sequences at the crosshair

        In order for  any  of  these  special  keys  to  operate,  the
  crosshair  must  be  in  an appropriate display box, and the precise
  function of the keys will also depend on which box the crosshair  is
  in.

        If the crosshair is in the "plot  all  contigs"  box,  Z  will
  cause  a  new box to appear showing all the readings for the nearest
  contig; Q will give the same as Z but will also produce an extra box
  showing the "quality" plot.

        If Z is hit in the "plot single contig" box, the  contig  will
  be  zoomed  to  the  current  graphics window size. The zoom will be
  roughly centred on the crosshair position. Because  of  this  it  is
  possible  to  step  along  a  contig  by repeatedly zooming with the
  crosshair near to one end of the single contig display box. If I  is
  hit  the crosshair must be close to a gel reading line. If Q is hit,
  the quality plot will be produced for the region shown in  the  plot
  single  contig  box. In all cases when the "plot all contigs" box is
  shown, a vertical line will  bisect  the  line  the  represents  the
  relevant contig, at the current position.

        If the crosshair is in the plot quality box only the character
  "s" will operate as a special symbol.

        The number of bases shown in the N and S options is controlled
  by  the  current graphics text window size, and the size of the zoom
  window by the current graphics window size.  Both  are  set  by  the
  parameter setting function of the general menu.
 @33. TX 2 @Plot single contig

        This option produces a schematic of a  selected  region  of  a
  single  contig by drawing a horizontal line to represent each of its
  gel readings. The lines show the relative positions of each  reading
  and  also  their  sense.  The  plot  is  divided vertically into two
  sections by a line that is identified by an asterisk drawn  at  each
  end.  All lines that lie above this line represent readings that are
  in their original sense, all lines below show readings that  are  in
  the  complementary  sense to their original. By use of the crosshair
  function the plot can  be  stepped  through  and  examined  in  more
  detail. See help on crosshair.
 @34. TX 2 @Plot all contigs

        This option produces a schematic  of  all  the  contigs  in  a
  database.  It  does  this  by drawing a horizontal line to represent
  each of them. In order to show the ends of each contig it draws  the
  lines for contigs at alternate heights: the first at height one, the
  second at height two, the third at height one, etc. The order of the
  contigs  in  the display is the same as their order in the database.
  By use of the crosshair function the plot can be stepped through and
  examined in more detail. See help on crosshair.
 @31. TX 3 @ Disassemble readings

        This function is used to remove a  list  of  readings  from  a
  database, or to create a new contig from a single reading moved from
  an existing contig.  This latter mode is useful for repositioning  a
  reading  in  a  repeat:  once separated it can be placed in the join
  editor and scrolled  by  the  other  copies.   Removal  of  sets  of
  readings  works  in  two  modes:  1. A set of adjacent readings in a
  contig can be removed by the user naming the two end ones; or  2.  A
  batch  of  readings from any number of contigs can be defined by the
  user naming a file containing a list of reading names.  The  program
  cleans  up  the database by moving data to fill up any holes made in
  the files.

        For both modes of operation the program will ask for a file of
  file names.  If users create their own file (ie mode 2) each reading
  NAME must be on a separate line. For mode 1 the user types the NAMES
  of  the  leftmost and rightmost readings to be removed. They and all
  intervening readings will be removed. Note that the routine operates
  on  reading  names  - not numbers. For both modes, if necessary, new
  contigs will be created.
 @35. TX 1 3 @Find internal joins

        The purpose of this function is to use  data  already  in  the
  database  to  find  possible  joins between contigs.  Joins may have
  been missed due to poor data or  may  have  not  been  made  due  to
  repeated  sequences.  Where  appropriate, it may be possible to find
  potential joins by using the "unused data" derived  from  sequencing
  machines.
  For all overlaps found when the X version is used, the contig editor
  (in join mode) will be called up with the two contigs aligned.
  The database is checked for logical consistency.  Supply  a  minimum
  initial  match  length,  a minimum alignment block, the maximum pads
  per sequence, the maximum  percent  mismatch  after  alignment,  the
  probe length. Choose if clipped data is to be used, if so define the
  window size for finding good data and the number of  dashes  allowed
  in  the  window. Processing will commence.  Most of these values are
  used in an identical way in the autoassemble  function.  The  others
  are defined below.
  The program strategy
  Take the first contig and calculate its consensus. If  clipped  data
  is  being  used  examine  all readings that are in the complementary
  orientation, and sufficiently near to the contigs left end,  to  see
  if  they have good clipped sequence which if present, would protrude
  from the left end of the contig.  If  found  add  the  longest  such
  sequence to the left end of the consensus. Do the same for the right
  end by examining readings that are in their original orientation. If
  any  are  found  add  the  longest extension to the right end of the
  consensus. Repeat the consensus calculations and extensions for  all
  contigs  hence  producing  an extended consensus. If clipped data is
  not  being  used  simply  calculate  the  consensus  for  the  whole
  database.  Now  look  for  possible joins by processing the extended
  consensus in the following  way.  Take  the  last,  say  100,  bases
  (termed  the  "probe  length"  by  the  program)  of  the  rightmost
  consensus, compare it both orientations with the extended  consensus
  of  all the other contigs. Display any sufficiently good alignments.
  Repeat with the left end of the rightmost contig. Do  the  same  for
  the ends of all the entended contigs, always only comparing with the
  contigs to their left, so that the same matches do not appear twice.
  Good cliped data is defined by sliding a window of "Window size  for
  good  data scan" bases outwards along the sequence and stopping when
  "Maximum number of dashes in scan window" or more dashes  appear  in
  the  window.   Note that it is advisable to have some sort of cutoff
  because if we simply take all the  data  it  might  be  so  full  of
  rubbish  that  we wont find any good matches. For the same reason it
  is worth trying the procedure with different cutoffs. An initial run
  using  no  clipped  data  is  also  recommended.   Sufficiently good
  alignments are defined by  criteria  equivalent  to  those  used  in
  autoassemble,  however here we only display alignments that pass all
  tests.
  Bugs
  If a small contig is wholly contained within a larger one, such that
  its  ends  are further than ("Probe length" - "Minimum initial match
  length") from the ends of the larger contig, and the  consensus  for
  the small contig lies to the left of the consensus for large contig,
  the overlap will not be discovered. (See the search stratgey).
  All numbering is relative to base number one in the contig:  matches
  to  the  left  (i.e.  in  the clipped data) have negative positions,
  matches off the right end of the contig (i.e. in the  clipped  data)
  have  positions  greater  than  that  of  the  contig  length.   The
  convention for reporting the positions of overlaps is as follows: if
  neither  contig needs to be complemented the positions are as shown.
  If the program says "contig x in the -  sense"  then  the  positions
  shown  assume  contig  x  has  been complemented. For example in the
  results given below the positions  for  the  first  overlap  are  as
  reported,  but  those  for  the second assume that the contig in the
  minus sense (i.e. 443) has been complemented.


   Possible join between contig   445 in the + sense and contig   405
   Percentage mismatch after alignment =  4.9
          412        422        432        442        452        462
       405  TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
             ********* * ********  ***** *** ********** ********** **********
       445  -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
         -127       -117       -107        -97        -87        -77
          472        482        492        502        512
       405  TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
            ********** ********** ********** ********** **
       445  TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
          -67        -57        -47        -37        -27
   Possible join between contig   443 in the - sense and contig   423
   Percentage mismatch after alignment = 10.4
           64         74         84         94        104        114
       423  ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
            **** ***** ********** ********** ******  ** ***** **** *********
       443  ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
         3610       3620       3630       3640       3650       3660
          124        134        144        154        164
       423  TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
            *** ****** ********** ** *******  *** ***** ** **
       443  TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
         3670       3680       3690       3700       3710


 @36. TX 3 @Double strand

        PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS
  IT HAS CURRENTLY HAD VERY LITTLE TESTING.

        Uses the cutoff data to change single stranded sections  of  a
  contig  into double stranded sections. Data is used carefully to try
  and minimise the number of data disagreements  created.  However  it
  must  be  noted  that  an overall slight degradation in quality will
  still occur.

        When using this option you will be prompted for a contig and a
  region  within that contig. The default region is the entire contig.
  The option will then search through the region  for  areas  of  good
  data on one strand and cutoff data on the opposite strand, extending
  the cutoff data. The criteria for evaluating the  amount  of  cutoff
  data  to  be used is based upon a maximum number of mismatches and a
  score  (derived  by  accumulating  points   for   mismatches   (-8),
  matches(+1)  and  insertions  (-5) over the length of an alignment).
  The defaults are:

  maximum mismatches      :  6

  score for mismatch      : -8
  score for correct match : +1
  score for insertion     : -5

        Note that with successive calls to this option it is  possible
  to  double  strand more and more data. Naturally however the quality
  of the data generated will diminish each time.
 @37. TX 3 @Auto-select oligos

        PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS
  IT HAS CURRENTLY HAD VERY LITTLE TESTING.

        Generates a file (default "primers") of suggested  primers  to
  use  for  covering  a single stranded section or for walking off the
  end of a contig. The file generated contains the gel  reading  name,
  the  primer sequence, it's offset in the contig and the orientation.
  An example file would be :

  c81d12.s1 TTGTCTGTAAGCGGATG (@ 6449 ) +
  c98a10.s1 ATTATCACTTTACGGGTC (@ 6959 ) +
  c81c1.s1 CAAGAAGGCGATAGAAG (@ 7643 ) +
  c76a10.s1 CCTCATCCTGTCTCTTG (@ 8441 ) +
  c81g4.s1 ATGAAACCTGGGCGTTG (@ 16156 ) +
  c91e6.s1 GTTTTCAGATGTCGGAG (@ 18249 ) +
  c81e12.s1 GCTACCGTAAAACACTTC (@ 18737 ) +
  c93h11.s1 GCTGCTTTTTGTTTTATCC (@ 19158 ) +
  c81h6.s1 CTTCCACTTCTTTCTTATC (@ 21210 ) +
  c86a12.s1 CGAATGATAAAGACAAATCAG (@ 22122 ) +
  c98b1.s1 GCCACTTTATCCGAGAC (@ 3048 ) -
  c97c5.s1 GTGTTTTGGGTATATTGTG (@ 3371 ) -
  c83d2.s1 CTACACAGAATGAACCC (@ 3768 ) -
  c78h10.s1 GGCGGTGAAGATTGAAG (@ 4200 ) -
  c98h9.s2dt CTCGTTTAAATTTCAAACTTCC (@ 7419 ) -
  c95a9.s1 ATTGGAAGGAAGGAGGG (@ 22996 ) -
  c82b4.s1 TGTAGCCGAAATCTTCC (@ 23369 ) -

        This is best employed after having previously used the 'Double
  strand' option.  When selecting the option you will be asked for the
  contig, a region within this contig and the file to write  the  list
  of  primers  to.  For  each  primer suggested a tag is automatically
  created containing details of the gel reading name and the sequence.
  Preferably the tag will be created on the gel reading from which the
  primer was selected. However this is not always possible so  failing
  that  the  tag  will  be  on another sequence overlapping the primer
  position.

        When invoked with the dialogue option  you  will  be  asked  a
  couple  more  questions  relating  to  the  position and size of the
  consensus checked for suitable oligos. You will be prompted for  the
  start and end of a region (default 40-140) at a relative position to
  the left of our initial region.

        For example:

  ? Menu or option number=d37
   Auto-select oligos
   Default Contig identfier=/e97f2.s1
   ? Contig identfier=
   ? Start position in contig (1-20942) (1) =10000
   ? End position in contig (10000-20942) (20942) =11000
   Default Name of file for primers=primers
   ? Name of file for primers=
   ? Start of oligo choice region (1-1024) (40) =50
   ? End of oligo choice region (50-1024) (150) =150


        This implies that we are going to look for oligos  to  use  as
  primers covering the region 10000 to 11000. For each single stranded
  section in this region we search for the oligos at  between  50  and
  150  to  the left. So if we had a single stranded section from 10121
  to 10295 we would search for oligos in the region 9971 to 10071.
 @38. TX 1 @Check assembly

        This new function is used  for  checking  the  positioning  of
  assembled  readings.   It  is  useful  for  checking  sequences that
  contain repeats of length similar to that of a single  gel  reading.
  It  takes  the poor quality data for each reading and compares it to
  the segment of the consensus  to  which  it  should  align.  If  the
  extension of the read does not match the consensus then the read (or
  its neighbours) has probably been assembled into  the  wrong  place.
  The  program  displays  the  bad  alignments.   The  quality  of  an
  alignment is defined by the percentage mismatch.  Naturally the user
  should  select  a  value that takes into account the poor quality of
  the data being aligned.

        When the routine is used  from  the  X  version  the  user  is
  offered  the  editor  to  examine poor alignments. If alignments are
  reported as poor, but on inspection are OK, the user can set  a  tag
  so  that  the  poor  quality  data  is  ignored on subsequent passes
  through the routine. Note  however  such  data  will  then  also  be
  ignored by the automatic double stranding routine!

        The user defines the percentage mismatch; the window size  and
  number of dashes allowed in the window used for selecting the amount
  of the poor data to be employed; can choose to save the names of the
  poorly  aligned  reads in a file; can select an individual contig or
  scan the whole database.  The  file  containing  the  names  of  the
  poorly  aligned  reads  can  be  used  by the disassembly routine to
  remove them from the database, and then can be  used  to  reassemble
  them.  Note  that  the  routine complements each contig twice during
  processing.
 @39. TX 1 @Find read pairs

        This new function is used to check the positions  of  readings
  taken  from  each end of the same template. For each forward read it
  searches for a corresponding reverse reading. The search can be over
  the  whole  database  or  over  a single contig.  The results can be
  presented graphically for single contig searches and  the  crosshair
  function can be used to identify the readings displayed.

        Note that at present the function only knows  that  two  reads
  are from the same template by comparing reading names. For our local
  projects we use the following naming convention: forward  reads  are
  named abcdefgh.s1 and reverse reads abcdefgh.r1. The program expects
  this naming convention and so if it finds read fred.s1  and  fred.r1
  it assumes they are the forward and reverse reads for template fred.
  In the future we will make the routine more general!

        If a single contig is selected and the output  is  listed  the
  program  displays  two lines for each pair: the first line shows the
  reading name, its position and length, and the distance between  the
  extremeties  of  the two reads; the second line shows the other read
  name, its position and length.  If  there  are  pairs  that  are  in
  separate contigs or are facing away from one another they are listed
  after the pairs that face inwards.  Is this true?

        If the results are plotted the full length of the template  is
  drawn  with  arrows indicating the direction of reads and the extent
  of each reading. Those reads that  have  their  partner  in  another
  contig are marked by asterisks.

        Typical dialogue is shown below.

   ? Select contigs (y/n) (y) =
   Default Contig identifier=/i55d8.s1
   ? Contig identifier=
   ? Start position in contig (1-15227) (1) =
   ? End position in contig (1-15227) (15227) =
   ? Plot results (y/n) (y) = n
      852 k23a1.r1            249   238  1615
      806 k23a1.s1           1529  -335
      238 i68e6.s1            422   193  1632
      868 i68e6.r1           1756  -298
      576 k17a2.s1           2370   213  1676
      885 k17a2.r1           3790  -256
       84 k27g6.s1           3456   291  1777
      867 k27g6.r1           4905  -328
      453 k01g10.s1          5805   142  1251
      881 k01g10.r1          6909  -147
      781 i98b8.r1           6754   338  1079
       10 i98b8.s1           7653  -180
      883 k02d11.r1          7327   276  1597
      283 k02d11.s1          8726  -198
      269 i68f9.s1           8191   169  1055
      777 i68f9.r1           8891  -355
      710 i91c6.s1           8245    95  1516
      780 i91c6.r1           9403  -358
      596 k27d12.s1           136   329  -329
      219 k27d12.r1             1  -116
      159 k27d11.r1          1830  -263  -263
      317 k27d11.s1          2902   343
      886 k17g11.r1          7107  -123  -123
      647 k17g11.s1          1867   265
      851 i69g10.r1          8045  -137  -137
      277 i69g10.s1          4658   174

        If contigs are not selected the  pairs  are  sorted  on  their
  separations.

   ? Select contigs (y/n) (y) = n
   i68f2.s1            27  1781  1777
   i68f2.r1           776   111  1777
   k17f6.s1           601    60  1706
   k17f6.r1           856  1405  1706
   k17a2.s1           576  2370  1676
   k17a2.r1           885  3790  1676
   k27g3.s1           177 14985  1664
   k27g3.r1           889 13564  1664
   k27b12.s1          764     1  1086
   k27b12.r1          857   932  1086
   i98b8.s1            10  7653  1079
   i98b8.r1           781  6754  1079
   k16a3.s1           748  1276  1070
   k16a3.r1           784   472  1070
   k17b7.r1           786 14937 18942*
   k17b7.s1           787  3601 18942*
   k27d12.r1          219     1 15208*
   k27d12.s1          596   136 15208*
   k01g2.s1           502    87 14754*
   k01g2.r1           782  9224 14754*

 @ end of help
No results found.