@-1. TX  0 @General

 @-2. T   0 @Screen control

 @-2. X   0 @Screen

 @-3. TX  0 @Modification

 @0.  TX -1 @SAP

        This is help information for the X  Windows  version  of  SAP.
  Currently  it  is  being brought up to date with the new features in
  XDAP.  The accuracy of this help should therefore not be assumed.

        This is an  interactive  program  whose  primary  use  is  for
  managing  shotgun  sequencing  projects, but it can also be used for
  handling alignments of other sequences, including those of proteins.
  Currently   the   maximum  'gel  reading'  length  is  set  to  4096
  characters. Almost all of the information below describes the use of
  the  program  for shotgun projects, but those using the programs for
  handling other sequence alignments should interpret it  accordingly.
  The data for such a project is stored in a special type of database.
  The program contains the tools that are  required  to  type  in  gel
  readings,  screen  them  against  vector  sequences  and restriction
  sites; enter new  gel  readings  into  the  database  (automatically
  comparing  and  aligning  them). In addition it contains editors and
  functions to examine the quality of the aligned sequences.

        There  are  three  main   menus:   "general",   "screen"   and
  "modification", and some functions have submenus.
    The general menu contains the following options:

         Open a database
         Display a contig
         List a text file
         Direct output
         Calculate a consensus
         Screen against restriction enzymes
         Screen against vector
         Check database
         Copy database
         Show relationships
         set parameters
         Highlight disagreements
         Examine quality
         Find internal joins

  The graphics menu contains:

         Clear graphics
         Clear text
         Draw ruler
         Use cross hair
         Change margins
         Label diagram
         Plot map
         Plot single contig
         Plot all contigs


  The modification menu contains:

         Edit contig
         Auto assemble
         Join contigs
         Complement a contig
         Alter relationships
         Extract gel readings


  The alter relationships menu contains:

         Cancel
         Line change
         Edit single gel reading
         Delete contig
         Shift
         Move gel reading
         Rename gel reading
         Break contig
         Alter raw data parameters



        Overview of the methodology

        The shotgun sequencing strategy

        In  the  shotgun  sequencing  procedure  the  sequence  to  be
  determined   is   randomly   broken  into  fragments  of  about  400
  nucleotides in length. These fragments are cloned and then  selected
  randomly  and  their  sequences    determined.     The  relationship
  between  any  pair  of fragments is  not  known  beforehand  but  is
  found  by  comparing  their   sequences.  If  the  sequence  of  one
  found to be wholly or partially contained  within  that  of  another
  for  sufficient  length  to  distinguish  an overlap  from  a repeat
  then those two fragments can  be  joined.  The  process  of  select,
  sequence  and  compare is continued until the whole of  the  DNA  to
  be  sequenced is in one continuous well determined piece.

        Definition of a contig

        A CONTIG is a set of gel  readings   that   are   related   to
  one  another   by   overlap  of  their  sequences.  All gel readings
  belong to a contig and each contig  contains  at   least   one   gel
  reading.   The  gel  readings in a contig can be summed to produce a
  continuous consensus sequence and the length of this sequence is the
  length  of the contig.  The rules used to perform this summation are
  given  under  "the  consensus  algorithm".   At  any  stage  of    a
  sequencing  project the data will comprise a number of contigs; when
  a  project  is complete  there  should be only one  contig  and  its
  consensus  will  be  the  finished  sequence.  Note that since being
  introduced and defined as above the word "contig" has been taken  up
  by  those involved in genomic mapping. In that context the consensus
  with a  precise length is not defined.

  Introduction to the computer method

        It is useful  to  consider  the  objectives  of  a  sequencing
  project  before  outlining  how  we use the computer to help achieve
  them. The aim of a shotgun  sequencing  project  is  to  produce  an
  accurate  consensus sequence from many overlapping gel readings.  It
  is necessary to know, particularly  at  the  latter  stages  of  the
  project,  how accurate the consensus sequence is. This enables us to
  know which regions of the sequence require further work and also  to
  know  when  the  project  is  finished.   To show the quality of the
  consensus, the programs described here produce  displays  like  that
  shown below.


                             10        20        30        40        50
     -6  HINW.010    GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
         CONSENSUS   GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA

                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC

                            110       120       130       140       150
     -6  HINW.010    GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
     -3  HINW.007    GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
     -5  HINW.009                                        GGCAGCGACCGTCG
     17  HINW.999                                           AGCGACCGTCG
         CONSENSUS   GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG

                            160       170       180       190       200
     -6  HINW.010    TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
     -3  HINW.007    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
     -5  HINW.009    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
     17  HINW.999    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
     12  HINW.017                                              GTAGAGC*
         CONSENSUS   TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*

        This is an example showing the left  end  of  a  contig   from
  position   1  to  200.   Overlapping  this  region  are gel readings
  numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in  reverse  orientation
  to  their  original  reading  (denoted  by  a  minus sign). Each gel
  reading also has a name (eg HINW.010). It can  be  seen  that  in  a
  number  of  places the sequences contain characters other than A,C,G
  and T. Some  of  these  extra  characters  have  been  used  by  the
  sequencer   to  indicate  regions  of  uncertainty  in  the  initial
  interpretation of the gel reading, but the asterisks (*)  have  been
  inserted  by  the  automatic assembly function in order to align the
  sequences.  Underneath  each  50  character  block  of  gel  reading
  sequences  is the consensus derived from the sequences aligned above
  (the line labelled CONSENSUS). For most of its length the  consensus
  has a definite nucleotide assignment but in a few positions there is
  insufficient agreement between the gel readings and so  a  dash  (-)
  appears  in  the  sequence.  This  display contains all the evidence
  needed to assess the quality of the consensus: the number  of  times
  the  sequence has been determined on each strand of the DNA, and the
  individual nucleotide assignments given for each gel reading.

        So the aim is to produce the consensus sequence  and,  equally
  important,  a  display of the experimental results from which it was
  derived.

        In order to achieve this the following operations need  to  be
  performed:
  1) Put individual  gel  readings  into  the  computer.   This  might
  involved   the  manual  interpretation  of  autoradiographs  or  the
  transfer and process  of  machine-readable  files  from  fluorescent
  sequencing machines.
  2) Check each gel reading to make sure it is not simply part of  one
  of the vectors used to clone the sequence.
  3) Check each gel reading to make sure  that  those  fragments  that
  span  the  ligation point used prior to sonication are not assembled
  as single sequences.
  4) Compare all the  remaining  gel  readings  with  one  another  to
  assemble them to produce the consensus sequence.
  5) Check the quality of the consensus and edit the sequences.
  6) When all the consensus is sufficiently well determined, produce a
  copy of it for processing by other analysis programs.

        It is very unlikely that this procedure will  only  be  passed
  through  once.   Usually steps 1 to 5 are cycled through repeatedly,
  with step 4 just adding new sequences to  those  already  assembled.
  Generally step 6 is also used in order to analyse imperfect sequence
  to check if it is the one the project intended to  sequence,  or  to
  look  for  interesting  features. Analysis of the consensus, such as
  searches for protein coding regions, can also help to find errors in
  the  sequence.  The  display  of  the overlapping gel readings shown
  above can be used  to  indicate,  not  only  the  poorly  determined
  regions,  but  also  which  clones  should be resequenced to resolve
  ambiguities, or those which can usefully be extended or sequenced in
  the reverse direction, to cover difficult regions.

        The original individual gel readings for a sequencing  project
  are  each  stored in separate files. As the gel readings are entered
  into the computer (usually in batches, say 10 from a film), the file
  names  they are given are stored in a further file, called a file of
  file names. Files of file names enable gel readings to be  processed
  in batches.

        For each sequencing project we start a project database.  This
  database  has  a  structure  specifically  designed for dealing with
  shotgun sequence data. In order to arrive  at  the  final  consensus
  sequence  many  operations  will  be performed on the sequence data.
  Individual fragments must be sequenced and compared in  both  senses
  (i.e.  both  orientations)  with  all  the  other sequences. When an
  overlap between a new gel reading and a contig are found  they  must
  be aligned and the new gel reading added to the contig. If a new gel
  reading overlaps two contigs they must be aligned and joined. Before
  the  two contigs are joined one of them may need to be turned around
  (reversed and  complemented)  so  they  are  both  in  in  the  same
  orientation.

        Clearly, keeping track of all  these  manipulations  is  quite
  complicated,  and  to  be  able  to  perform  the operations quickly
  requires careful choice of data structure and algorithms. For  these
  reasons  it  is not practicable to store the gel readings aligned as
  shown in the display above. Rather, it is more convenient  to  store
  the  sequences unassembled, and to record sufficient information for
  programs to assemble  them  during  processing.  The  data  used  to
  assemble the sequences is called relational information.

        The database comprises five files and they are described under
  the section entitled "open database".

        Before entry into the project database each  new  gel  reading
  must  be  compared  to  look  for overlaps with all the data already
  contained within the database. This last  point  is  important:  all
  searching  for  overlaps  is between individual new gel readings and
  the data already in the database. There is no searching for overlaps
  between sequences within the database; overlaps must be found before
  new gel readings are entered into the database.

        Below  I  give  an  introduction  to  how  the  sequences  are
  processed by being passed from one function to the next.

        This program is used to start a database for the  project  and
  then the following procedure is used.

        Data in the form of individual gel readings are  entered  into
  the  computer and stored in separate files using either program this
  program or the digitizer program. Batches of these gel readings  are
  passed  to  the  screening  functions  in this program to search for
  overlaps with vector sequences  ("screen  against  vector")  or  for
  matches  to  restriction  enzyme  sites   that should not be present
  ("screen against enzymes"). Each run of  these  screening  functions
  passes  on  only  those  gel  readings  that do not contain unwanted
  sequences.  Sequences  are  passed  via  files  of  file  names  and
  eventually  are  processed by the automatic assembly function ("auto
  assemble"). This function compares each gel reading with a consensus
  of  all  the  previous  gel  readings stored in the database.  If it
  finds any overlaps it aligns the overlapping sequences by  inserting
  padding  characters,  and  then  adds  the  new  gel  reading to the
  database. Gels that overlap are added to existing contigs  and  gels
  that do not overlap any data in the database start new contigs. If a
  new gel overlaps two contigs they are joined. Any gel readings  that
  appear  to overlap but which cannot be aligned sufficiently well are
  not entered and have their names written to a  file  of  failed  gel
  reading names.

        Generally data is entered into the database in batches as just
  described.  The  program  is  also  used  to examine the data in the
  database, to enter gel readings that the automatic assembly function
  cannot  align  ("auto  assemble"), and to make final edits. Edits to
  whole contigs can  be made in several ways.  A  mouse-driven  editor
  ("edit   contig")   is   used   to   perform   all  edits  manually.
  Disagreements between gel readings in contigs  and  their  consensus
  sequences  can  be  highlighted  by  use  of the function "highlight
  disagreements".

        Editing the  sequences  is  obviously  an  essential  part  of
  managing   a  sequencing  project.  Editing  is  required  when  new
  sequences are added, when contigs are joined, and when sequences are
  corrected.   A  basic part of the strategy used here is that new gel
  readings should be correctly aligned throughout their  whole  length
  when  they  are entered into the database, and that when contigs are
  joined they are edited so that they are well aligned in  the  region
  of  overlap.  Alignment can be achieved by adding padding characters
  to the sequences, and this is the way "auto assemble" operates  when
  adding new sequences to the database.

        In order to search for overlaps that may have been missed  due
  to  errors  in the gel readings, the function "extract gel readings"
  can be used to take copies of  the  gel  readings  at  the  ends  of
  contigs,  and  write  them out as separate files.  These can then be
  compared with the  database  consensus  using  the  "auto  assemble"
  function in a mode that forbids entry of data into the database, and
  any gel reading matching two contigs will indicate a join  that  has
  been  missed.  The  joins can then be made interactively using "join
  contigs". Missed matches can be found  at  this  stage  because  the
  errors in the sequences may have been corrected by new data.

        Generally the users need not concern themselves with  how  the
  relational  information  is used by the program, but it is necessary
  to know how contigs are identified. Because contigs  are  constantly
  being  changed  and  reordered  the  program  identifies them by the
  numbers of the gel readings they contain.  Whenever  users  need  to
  identify  a  contig they need only know the number or name of one of
  the gel readings it contains. Whenever the  program  asks  users  to
  identify  a  contig  or  gel reading they can type its number or its
  archive name. If they type its archive name they  must  precede  the
  name by a slash "/" symbol to denote that it is a name rather than a
  number. E.g if the  archive name is fred.gel with number  99,  users
  should  type  /fred.gel  or  99  when  asked to identify the contig.
  Generally, when it asks for the gel reading to  be  identified,  the
  program  will  offer  the user a default name, and if the user types
  only return, that contig will be accessed. When a database is opened
  the  default  contig  will  be  the  longest  one, but if another is
  accessed, it will subsequently  become the current default.

        Further information is located in the  following  places.  The
  database  files  are described under "open database". The format for
  vector  and  consensus  sequences  is  given  under   "calculate   a
  consensus", as are the uncertainty codes used in gel readings.

        There  are  two  programs,  other  than  this,   relevant   to
  sequencing  are the digitizer program  and the trace editor program,
  both is outlined briefly below.

        The digitiser program is used for the  initial  input  of  gel
  readings  and  for  writing a file of file names. The program uses a
  digitizer for data  entry.   A  digitizer  is  a   two   dimensional
  surface  such  as a light box which is such that if a special pen is
  pressed onto it, the pens coordinates are recorded  by  a  computer.
  These coordinates can be interpreted by a program.

        In order to read an autoradiograph placed on the light box the
  user  need  only  define the bottom of the four sequencing lanes and
  the bases to which they correspond and then use  the  pen  to  point
  to   each  successive   band  progressing  up  the gel.  The program
  examines the coordinates of each pen position to see in which of the
  four  lanes  it   lies  and  assigns  the  corresponding  base to be
  stored in the computer.  Each time the pen tip is depressed to point
  to  a  position on  the  surface of the digitizer the program sounds
  the bell on the terminal to indicate to the user that  a  point  has
  been recorded.  As the  sequence  is read the program displays it on
  the screen.

        The trace editor program is used for the initial processing of
  data  obtained  from  fluorescent sequencing machines. It allows the
  user to visually select left and right cutoff  positions  to  denote
  the  start and end of good data. Users may also edit the sequence at
  this point.  Output from ted is a sequence  file  in  Staden  format
  with headers that describe to xdap the cutoff information.
 @17. TX 1 @Screen against enzymes

        Used to compare gel readings against  any  restriction  enzyme
  recognition  sequences  that  may have been used  during cloning and
  which should not be  present  in  the  data.  Works  on  single  gel
  readings  or processes batches accessed through files of file names.
  The algorithm looks  for  exact  matches  to  recognition  sequences
  stored in a file.

        The  file  containing  the  recognition  sequences   must   be
  identified.  The  user  must choose between employing a file of file
  names, or typing in the names of individual gel reading files. If  a
  file  of  file names is used the program will also create a new file
  of file names. When the option has finished operating this new  file
  will  contain the names of all those gel readings that did not match
  any of the recognition sequences. Hence it can be used  for  further
  processing  of the batch. The recognition sequences should be stored
  in a simple text file with one recognition sequence per line.
 @18. TX 1 @Screen against vector

        Used to compare gel readings against any vector sequences that
  may have been picked up during cloning. Works on single gel readings
  or processes batches accessed  through  files  of  file  names.  The
  algorithm  looks  for exact matches of length "minimum match length"
  and displays the overlapping sequences.

        The file containing the vector sequence  must  be  identified.
  The  user  must  choose  between  employing a file of file names, or
  typing in the names of individual gel reading files. If  a  file  of
  file  names  is used the program will also create a new file of file
  names. When the option has finished operating  this  new  file  will
  contain  the  names of all those gel readings that did not match the
  vector sequence. Hence it can be used for further processing of  the
  batch.  The  vector  sequence should be stored in a simple text file
  with up to 80 characters of data per line. More than one vector  can
  be  stored  in  a single file. If so each should be preceded by a 20
  character title of the form <---m13mp8.001-----> where the <  and  >
  signs  and  the  number like .001 are obligatory. The number must be
  preceded by a dot (.) and be 3 digits long. The  total  sequence  in
  the file must be < 50,001 characters in length.
 @20. TX 3 @Auto assemble

        Compares gel readings against  the  current  contents  of  the
  database  and  produces  alignments. In its normal mode of operation
  ("entry permitted"), the function will automatically enter  the  gel
  readings  into  the  database, but if entry is not permitted it will
  only  produce  alignments.  It  works  on  single  gel  readings  or
  processes  batches  of  gel  readings accessed through files of file
  names. It is the usual way to enter data into the database.

        The function will check the database for  logical  consistency
  and  will only proceed if it is OK. Choose if gel readings should be
  entered into the database, or  if  they  should  only  be  compared.
  Choose  between  using  a file of file names or typing file names on
  the keyboard. If so selected, supply the file of  file  names.  Also
  supply  a  file  of  file  names to contain the names of all the gel
  readings that fail to get entered. Select  the  entry  mode.  Normal
  assembly  is  appropriate  for  all but special cases, as is "permit
  joins". Uses for the other modes are not documented here.  Define  a
  minimum  initial match length. Define a minimum alignment block (the
  default value is taken in all but exceptional circumstances). Define
  the  maximum number of padding characters allowed to be used in each
  gel reading to help achieve alignment, and the same for  the  number
  allowed  in  the  contig  for  each  gel reading. Finally define the
  maximum percentage mismatch to be allowed for any gel reading to  be
  entered  into  the database. If for any gel reading, either of these
  last three values is exceeded the gel reading will  not  be  entered
  into the database.

        In operation the  function  takes  a  batch  of  gel  readings
  (probably   passed  on   as   a  file  of file names from one of the
  screening routines) and enters them into a database for a sequencing
  project.  It takes each  gel reading in  turn, compares  it with the
  current consensus for the database, it then produces  an   alignment
  for   any   regions   of   the   consensus   it overlaps;   if  this
  alignment is sufficiently good  it  then  edits  both  the  new  gel
  reading  and  the  sequences  it  overlaps   and   adds the new  gel
  reading to the database.  The program  then  updates  the  consensus
  accordingly and carries on to the next  gel  reading.

        All alignments are displayed and  any  gel  readings  that  do
  match but  that cannot be aligned sufficiently well have their names
  written to a file of failed gel reading names.  The  function  works
  without   any  user  intervention  and can process any number of gel
  readings in a single run.  Those  gel  readings  that  fail  can  be
  recompared  using  the  same  function  (to find the current overlap
  position) and  the user  can enter them into the  database  manually
  using  the   "enter new gel reading" option.

        Typical dialogue and output from the function is shown  below.
  (Note  that  output  for gel readings 2 - 9 has been deleted to save
  space).
  Automatic sequence assembler
  Database is logically consistent
  ? (y/n) (y) Permit entry
  ? (y/n) (y) Use file of file names
  ? File of gel reading names=demo.nam
  ? File for names of failures=demo.fail
  Select entry mode
  X  1 Perform normal shotgun assembly
     2 Put all sequences in one contig
     3 Put all sequences in new contigs
  ? Selection  (1-3) (1) =
  ? (y/n) (y) Permit joins
  ? Minimum initial match (12-4097) (15) =
  ? Minimum alignment block (2-5) (3) =
  ? Maximum pads per gel (0-25) (8) =
  ? Maximum pads per gel in contig (0-25) (8) =
  ? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Processing           1 in batch
    Gel reading name=HINW.004
    Gel reading length=   283
    Searching for overlaps
    Strand     1
    Strand     2
    No matches found
    Total matches found           1
    Padding in contig=    0 and in gel=    1
    Percentage mismatch after alignment =  1.8
    Best alignment found
           1         11         21         31         41         51
           TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
           ********** ********** ********** ********** ********** **********
           TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
           1         11         21         31         41         51
          61         71         81         91        101        111
           TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
           ********** ********** ********** ********** ********** ***** ****
           TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
          61         71         81         91        101        111
         121        131        141        151        161        171
           CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
           ********** ********** ********** ********** ********** **********
           CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
         121        131        141        151        161        171
         181        191        201        211        221        231
           GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
           ********** ********** * ******** ********** ********** **********
           GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
         181        191        201        211        221        231
         241        251        261        271        281
           CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
           ****  **** ********** ** * ***** ********** ****
           CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
         241        251        261        271        281
            Batch finished
            9 sequences processed
            0 sequences entered into database
            0 joins made


        Note that "auto assemble" cannot align protein sequences.
 @28. TX 1 @Highlight disagreements

        Used  in  the  latter  stages  of  a  project   to   highlight
  disagreements  between  individual  gel readings and their consensus
  sequences. Characters that agree with the consensus are shown  as  :
  symbols  for  the plus strand and . for the minus strand. Characters
  that disagree with the consensus are left unchanged and so stand out
  clearly. The results of this analysis are written to a file.

        Before selecting this option create a file of the  display  of
  the  contig to be "highlighted". The option will ask for the name of
  this file. Select symbols to denote "agreeing"  characters  on  each
  strand, the defaults are : and ., but any others can be used. Supply
  the name of a file in which to put the output.

        The display file needed as input for this option is created by
  selecting  "Redirect  output",   followed  immediately  by  "display
  contig", and then "Redirect output" again. The cutoff score used  in
  the  consensus  calculation  can  be  set  by  option  "set  display
  parameters". Note that for the highlight function there is  a  limit
  of  50  for  the  number  of  gel  readings  that are aligned at any
  position - ie the contig must be less than 51 gel readings  deep  at
  its  thickest point. I hope that those performing shotgun sequencing
  never reach this limit, but those using the  program  for  comparing
  sequence families might.

        Typical output from this function is shown below.

                            210       220       230       240       250
      1  HINW.004    :C::::::::::::::::::::::::::::::::::::::::::AC::::
      7  HINW.018    :*::::::::::::::::::::::::::::::::::::::::::CA::::
     -4  HINW.017                                 ...............AC....
                     G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG

                            260       270       280       290       300
      1  HINW.004    ::::::::::::*:D:::::::::::::::::::
      7  HINW.018    ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
     -4  HINW.017    ..............................................A...
      3  HINW.009    :::::::::::::::V::::::::::::::::::::::::::::*AV:::
     -6  HINW.028                            ......................A...
                     AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG

 @32. TX 3 @Extract gel readings

        Used to make copies of the aligned gel readings in a database,
  to write them into separate files, and to write a corresponding file
  of file names. It operates in two modes: either all gel readings are
  extracted, or only those at the ends of contigs.

        Choose which mode of operation is required and supply  a  file
  of file names.

        The gel readings are given their original names.  If  used  to
  extract  the  gel  readings from the ends of contigs the function is
  useful for checking for missed contig joins: the file of file  names
  can  be  used with the auto assemble function to recompare these gel
  readings, and each should only overlap one contig. Any that  overlap
  two contigs will identify possible joins.

        If the option is used to extract all the gel readings  from  a
  database,  a  subsequent  run  of "auto assemble" can reconstitute a
  database which has  been  corrupted.  This   rarely  occurs  and  is
  usually  necessitated  by  a  user   employing "alter relationships"
  incorrectly without first having made a copy.
 @1. TX 0 @Help

        Help is available on the following topics :
 @2. TX 0 @Quit

        This command stops the program and is the  only  safe  way  to
  terminate  a run of the program that has altered the contents of the
  database in any way.
 @3. TX 1 @Open a database

        Opens existing databases or allows new ones to be started. The
  function  is automatically called into operation when the program is
  started but can also be selected from the general menu.

        Choose to open an existing database or start a new one, or  if
  !  is  typed  when  the  program is first started, enter the program
  without opening a database. Supply a project database name,  and  if
  it  already exists, the "version". If starting a new database define
  the database size and if it is for DNA or  protein  sequences.   The
  database  size  is  an  initial  size  for  the  database. It can be
  increased later during the project. It is the sum of the  number  of
  gel readings plus the number of contigs.

        Database names can have from one to 12 letters  and  must  not
  include  full  stop  (.).  The  database  is made from five separate
  files. If the database is called FRED then  version  0  of  database
  FRED  comprises  files  FRED.AR0,  FRED.RL0,  FRED.SQ0, FRED.TG0 and
  FRED.CC0. The version is the last symbol in the  file  names.   Only
  this  program can read these files. If the "copy database" option is
  used it will ask the user to define a new "version".

        For normal use the maximum gel reading length is  set  to  512
  characters,  but  when  a  database  is  started the user may choose
  lengths of either 512, 1024, 1536..., 4096. Normally the program  is
  used  to handle DNA sequences but many of the functions also work on
  protein sequences. The choice of sequence  type  is  made  when  the
  database is started.

        The contigs are not stored on the disk as the user  sees  them
  displayed  on the screen. Each gel reading is stored with sufficient
  information about how it overlaps other gel  readings  so  that  the
  program  can  work out how to present them aligned on the screen. We
  refer to this extra data as "the relationships" and it is  explained
  below.  The database comprises 5 separate files.
  1.  a working version of each gel reading.  This is the  version  of
  the   gel   reading  that  is in the database and initially it is an
  exact copy of the original sequence (known as the archive) but it is
  edited and manipulated to align  it with other gel readings.
  2.  the file of  relationships.   This  file  contains  all  of  the
  information  that  is required to assemble the working versions into
  contigs during processing;  any manipulations on the data  use  this
  file   and  it  is  automatically  updated  at  any  time  that  the
  relationships are changed.  The  information  in  this  file  is  as
  follows:
  (A) Facts about  each   gel  reading   and   its   relationship   to
  others ("gel descriptor lines"):
  (a) the number of the gel reading   (each gel reading   is  given  a
  number  as  it  is entered into the database)
  (b) the length of the sequence from this gel reading
  (c) the position of the left end of this gel reading    relative  to
  the left end of the contig of which it is a member
  (d) the number of the next gel reading   to the  left  of  this  gel
  reading
  (e) the number of the next gel reading   to the right
  (f) the relative strandedness of this gel reading  , ie whether   it
  is  in the same sense or the complementary sense as its archive.
  (B) Facts about each contig ("contig descriptor lines"):
  (a) the length of this contig
  (b) the number of the leftmost gel reading   of this contig
  (c) the number of the rightmost gel reading   of this contig.
  (C) General facts:
  (a) the number of gel readings in the database
  (b) the number of contigs in the database.
  3.  the file of archive names.  This is simply a list of  the  names
  of each of the archive files in the database but on line number 1000
  we also store the size of the database. ie the number  of  lines  of
  information allowed in the database files. This file always has 1000
  lines but the length of the file of relationships and  the  file  of
  working  versions can be set by the user when creating a database or
  when copying from one to another.
  4. the file of tags (annotation). This consists of linked  lists  of
  tag  information  for  each  sequences  in  the  database.  Tags are
  created by the user as annotation, or by xdap as records of edits or
  for  storing  cutoff  information.   As  the number of tags can grow
  without limit, so can this file.  For each gel  there  is  a  header
  record,  which contains the record number of the start of the linked
  list for that gel. On line  IDBSIZ  there  is  a  record  containing
  information  about  the file such as its present length and if there
  are any free "tag" slots to be reused in the file.  5. the  file  of
  comments  (annotation).   This  consists  of linked lists of comment
  fragments.  Comments are created by the user as a  message  attached
  to  annotation,  or  by  the  system  to  store  cutoff information.
  Comments are character strings of any length.  Comments longer  than
  40 characters are broken up into fragments, each 40 characters long,
  and are chained together in a link list.  As the number of  comments
  can grow without limit, so can this file.

        Structure of the database files

        1.  The file of relationships

        The file contains IDBSIZ lines of data:  the general data  are
  stored  on line IDBSIZ;   data about  gel readings  are stored  from
  line 1 downwards;  data about contigs are stored from line  IDBSIZ-1
  upwards.  A  database  of 500 lines containing 25 gel readings and 4
  contigs would have a file of relationships as is shown below.


                    ---------------------------------------------
                       1  Gel descriptor record
                       2   "      "       "
                       3   "      "       "
                       4   "      "       "
                       5   "      "       "
                       '   '      '       '
                       '   '      '       '
                      25   "      "       "
                      26  Empty record
                       '    '     '

                       '    '     '
                     495    '     '
                     496  Contig descriptor record
                     497    "        "        "
                     498    "        "        "
                     499    "        "        "
                     500   Number of gel readings=25, Number of contigs=4
                    ---------------------------------------------

            The arrangement of the data in the file of relationships

  As each new gel reading   is added into the database a new  line  is
  added to  the  end  of  the  list  of gel descriptor lines.  If this
  new gel  reading  does not overlap with any gel readings already  in
  the  database  a new contig  line  is added  to  the top of the list
  of contig lines.  If it overlaps with one contig then no new  contig
  line  need  be  added  but  if it  overlaps with  two  contigs  then
  these  two  contigs must be joined and the number  of  contig  lines
  will  be reduced by one. Then the list of contig lines is compressed
  to  leave  the empty line at the top of the list.  Initially the two
  types  of  line will move towards  one  another  but eventually,  as
  contigs  are joined, the contig descriptor lines will  move  in  the
  same  direction  as the  gel descriptor lines.   At  the  end  of  a
  project  there should  be only one contig  line.   The  database  is
  thus capable of handling a project of 998 gels.

        2.  Structure of the working versions file

        The working versions of gel readings are stored  in   a   file
  of  IDBSIZ lines each containing 512 characters.  Gel reading number
  1 is stored on line 1, gel reading number  2 on line 2 and so on.

        3.  Structure of the archive names file

        This file, unlike the others, always has 1000  lines  each  10
  characters  in length. Its length is fixed because line 1000 is used
  to store IDBSIZ the database size and the programs need  a  definite
  location from which to read this number.

        4.  Structure of the tag file

        This file initially starts with IDBSIZ lines, and is  expanded
  as  new tags are created.  Information about the length of the file,
  and which tag records are reusable is  stored  on  line  IDBSIZ.   A
  database of 500 lines would have a file of tags as shown below.

                    ---------------------------------------------
                       1  Tag descriptor record
                       2   "      "       "
                       3   "      "       "
                       4   "      "       "
                       5   "      "       "
                       '   '      '       '
                       '   '      '       '
                     497   "      "       "
                     498   "      "       "
                     499   "      "       "
                     500   Length of file=N, Free list=0
                     501  Tag record
                     502   "   "
                     503   "   "
                       '   '   '
                       '   '   '
                     N-2   "   "
                     N-1   "   "
                       N  Tag record
                    ---------------------------------------------

            The arrangement of the data in the file of relationships

  As each new tag is added to the database, a check  is  made  in  the
  file  descriptor  record  at  line  IDBSIZ.  If the list of reusable
  records is 0, the file is extended by one line.  Otherwise  the  new
  tag  is  assigned  to record at the head of the freelist.  When tags
  are deleted, they are added to the free list in the file  descriptor
  record.

        5.  Structure of the comment file

        This file initially starts with 1 line, and is expanded as new
  annotation  is  created.   Information about the length of the file,
  and which comment records are reusable is stored on the first line.

                    ---------------------------------------------
                       1  Length of file=N, Free list=0
                       2  Comment fragment
                       3   "       "
                       4   "       "
                       '   '       '
                       '   '       '
                     N-2   "       "
                     N-1   "       "
                       N  Comment fragment
                    ---------------------------------------------

            The arrangement of the data in the file of relationships

  As each new comment is added to the database, a check is made in the
  file descriptor record at line 1. If the list of reusable records is
  0, the file is extended to hold the new comment. Otherwise  the  new
  comments  is  assigned  to  records  starting  with  the head of the
  freelist.  When comments are  deleted,  the  discarded  records  are
  added to the free list in the file descriptor record.

        There  are  various  checks  within  the  programs  to protect
  users from themselves:-
  1.  All user input is checked for  errors  -  e.g.    reference   to
  non-existent  gel readings or  contigs,  incorrect  positions in the
  contig or gel readings.
  2.  Before entering a gel reading the system checks to see if a file
  of the same name has already been entered.
  3.  Join will not allow the circularising of a contig.
  4.        Both enter and join  functions  restrict  the  region that
  the   user  is  allowed to edit (using edit contig) to the region of
  overlap.
  5. Users may escape from any point in the program.
  6. Help is available from all points in the program.


  IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
  ANYTHING  THAT  INVOLVES  CHANGING THE CONTENTS OF THE DATABASE. I.E
  DURING AUTO ASSEMBLE,  COMPLETE  ENTRY,  COMPLETE  JOIN,  COMPLEMENT
  CONTIG,  EDIT  CONTIG,  AND  SCREEN  EDIT.   This  could corrupt the
  database so badly that it is impossible to fix. The  program  should
  always be left using the QUIT option.
 @4. TX 3 @Edit contig

        The Contig Editor is a mouse-driven editor  that  can  insert,
  delete and change gel reading sequences.

        The Contig Editor allows scrolling from one end of a contig to
  the  other  using the scroll bar and scroll buttons. Action of mouse
  button presses when the mouse pointer is in the scroll bar:

      Middle Mouse Button      Set editor position
      Left   Mouse Button      Scroll forward one screenful
      Right  Mouse Button      Scroll backwards one screenful

  The four scroll buttons operate as follows:

      "<<"                     Scroll left half a screenful
      "<"                      Scroll left one character
      ">"                      Scroll right one character
      ">>"                     Scroll right half a screenful

        The Editor cursor can  be  positioned  anywhere  in  the  edit
  window  by  moving the mouse pointer over the character of interest,
  then pressing the left mouse button. The Editor cursor can  also  be
  moved by using the direction arrow keys.

        The editor operates in two  main  edit  modes  -  Replace  and
  Insert. Replace allows a character to be replaced by another. Insert
  allows characters to  be  inserted  into  a  gel  reading  sequence.
  Characters  are entered by typing them from the keyboard. Only valid
  characters are permitted.  Characters can be deleted by  positioning
  the cursor one character to the right, then pressing the delete key.
  Normally Insert and Delete apply to the consensus line of the contig
  ONLY.  This  restraint  can  be overridden by using the "Super Edit"
  mode of operation, THOUGH IT IS NOT RECOMMENDED.

        Edits can also be performed on the consensus, though they  are
  restricted  to  insertion  and deletion of padding characters ("*").
  These edits also have special meanings.  A deletion will delete  ALL
  characters  at the position to the left of the cursor in the contig,
  and move the relative positions of all  sequences  starting  to  the
  right  of the cursor position left one character.  An insertion will
  insert the character typed ("*") into ALL gel reading  sequences  at
  the  cursors position in the contig, and move the relative positions
  of all sequences starting to the right of the cursor position  right
  one character.

        The effect of the last edit can  be  undone  by  pressing  the
  "Undo" button at the top of the editor window.

        The cursor  will  automatically  be  positioned  at  the  next
  problem  when  the  "Find Next Problem" button is selected. The next
  problem is where the consensus shows either an ambiguity ("-") or  a
  pad ("*") character.

        The edits to the contig can be saved by  pressing  the  "Leave
  Editor"  button and replying "Yes" to the prompt to "Save changes?".
  As no changes are made to the working copy of your database til this
  point it is possible to abort the editor if the edit session ends up
  in an unsatisfactory state (ie if you've stuffed it up!)



 Displaying Traces

        The original data from which the gel reading  sequences  where
  derived  can  be seen by double clicking (two quick clicks) with the
  middle mouse button on the area  of  interest.  The  trace  will  be
  displayed  with  the  point  clicked  at  the  centre  of  the trace
  viewport.

        All traces that are displayed are maintained  in  one  window,
  called  the  Trace Manager. The Trace Manager will only display four
  traces maximum. When four traces are already being managed and a new
  one is requested, the one at the top of the Trace Manager is removed
  and the new one is added to  the  bottom.   Traces  can  be  removed
  individually  by  using  the  "quit" button in the panel next to the
  trace.



 Extending Reads Using Cutoff Information

        Sequence data read in from  Automated  Fluorescent  sequencing
  machines trace files processed through the program ted will have the
  discarded sequence (vector at start and poor read at end)  available
  to  the  contig editor. To display the cutoff information, press the
  "Display Cutoff" button at the top of the editor window.  The cutoff
  sequence appears in grey. This sequence can be incorporated into the
  editable sequence, by moving the cutoff position. This  is  done  by
  positioning  the  cursor  at  the end of the gel sequence, and using
  Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of  cutoff.
  The Meta key is a diamond on the Sun keyboard.



 Pop-up menu

        A pop-up menu is revealed by depressing the "Control"  key  on
  the  keyboard  and  at the same time pressing the left mouse button.
  The menu has the following functions:

      Search
      Save Contig
      Create Tag
      Edit Tag
      Delete Tag

  "Save Contig" is described above.  Searching and operations on  tags
  are described below.



 Searching

        Selecting "Search" brings up a window which can remain present
  during normal editor operation. The window allows the user to select
  the direction of search, the type of search and a  value  to  search
  on.   The value is entered into the value text window. Then pressing
  the "search" button performs the search. If successful,  the  cursor
  is  positioned  and  centred  accordingly. An audible tone indicates
  failure.  Pressing the "ok" button removes the  search  window.  The
  search  window  is  automatically  removed when the contig editor is
  exited.

  There are seven different search modes:

  1. Search by position

  This positions the cursor at the numeric position specified  in  the
  value  text  window.  Eg  a  value of "1234" causes the cursor to be
  placed at base number 1234 in the contig. Positioning withing a  gel
  reading  is achieved by prefixing the number with the "@" character,
  eg "@123" positions the cursor at base 123 of the sequence in  which
  the  cursor  lies.  Relative positions can be specified by prefixing
  the number with a plus or minus character. Eg "+1234"  will  advance
  the  cursor 1234 bases. If possible, the cursor is positioned within
  the same sequence.  The direction buttons  have  no  effect  on  the
  operation of "search by position".

  2. Search by reading name

  This positions the cursor  at  the  left  end  of  the  gel  reading
  specified  in the value text window. If the value is prefixed with a
  slash is is assumed to be  a  gel  reading  name.  Otherwise  it  is
  assumed to be a gel reading number. Eg "123" positions the cursor at
  the left end of gel reading number 123.  "/a16a12.s1"  positions  at
  the  start  of reading a16a12.s1. If the value was "/a16" the cursor
  is positioned at the first reading which  starts  with  "a16".   The
  direction  buttons  have  no  effect  on the operation of "search by
  position".

  3. Search by tag type.

  This positions the cursor at the start of the next tag which has the
  the  same  type  as  specified by the type value menu. To change the
  type, select off the menu that pops up when the mouse is clicked  on
  the  button  labeled  "Type:".  The  search  can be performed either
  forwards or backwards of the current cursor position.  To  find  all
  tags, use "search by annotation", with a null text value string.

  4. Search by annotation.

  This positions the cursor at the start of the next tag which  has  a
  comment  containing  the  string specified in the value text window.
  The search performed is a regular  expression  search,  and  certain
  characters  have  special meaning. Be careful when your value string
  contains ".", "*", "[", "^" or "$".  The  search  can  be  performed
  either forwards or backwards from the current cursor position.

  5. Search by sequence.

  This positions the cursor at the start of the next piece of sequence
  that  matches  the  value  specified  in  the text value window. The
  search is for an exact match, which means the case of  value  string
  is   important.   The  search  is  performed  on  the  gel  readings
  themselves, rather than the consensus sequence. The  search  can  be
  performed  either  forwards  or  backwards  from  the current cursor
  position.

  6. Search by problem.

  This positions the  cursor  at  the  next  place  in  the  consensus
  sequence  which  is  not  an "A", "C", "G" or "T". The search can be
  performed either forwards  or  backwards  from  the  current  cursor
  position.

  7. Search by quality

  This positions the  cursor  at  the  next  place  in  the  consensus
  sequence  where the consensus calculation for each strand disagrees.
  When only sequences on one strand is present, the search  will  stop
  at  every  base.  The  search  can  be  performed either forwards or
  backwards from the current cursor position.



 Annotation

        Parts of a sequence can be annotated, to record the  positions
  of  primers used for walking, or to mark sites, such as compressions
  that have caused problems during sequencing.  The consensus sequence
  CANNOT be annotated.

        To annotate a piece of  sequence  first  select  the  part  of
  sequence  using  the  mouse  buttons.  Use  the left mouse button to
  position the start of the selection, and while this button is  being
  held  down, move the mouse to extend.  The selection can be extended
  further using the right mouse button.

        To create annotation, invoke the pop-up menu, and  select  the
  "Create Tag" function. A small "tag editor" will appear which allows
  you to select the type of the annotation from a pull-down menu,  and
  specify  a  comment  if desired.  To select a new type pull down the
  Type menu, and select the entry desired.  To enter a comment, simply
  type  into  the  text  window  in the tag editor.  The annotation is
  created when the "Leave" button on the tag editor, and is  displayed
  in the colour defined in the tag database file (TAGDB).

        To edit existing annotation, position the cursor with the left
  mouse  button  on  the tag, and select the "Edit Tag" off the pop-up
  menu.  This invokes the tag editor, and  changes  to  the  type  and
  comment  of  the annotation can be made. The tag is updated when the
  "Leave" button is pressed.

        To delete an existing annotation, position the cursor with the
  left  mouse  button  on the tag, and select the "Delete Tag" off the
  pop-up menu.



 NOTE:

        As the Contig Editor is a very powerful tool, it  is  possible
  that  the  alignment  of  the gel reading sequences has unexpectedly
  been disrupted.  This can easily happen to parts of the contig  that
  lie to the right of the screen if excessive use has been made of the
  "Super Edit" facility.  Until familiar with "Super  Edit"  it  would
  benefit  the  sequencer  to  quickly  scan  through the contig after
  editing to check that bad alignments have not been created.
 @9. T 3 @Screen edit

        THIS OPTION IS NO LONGER AVAILABLE IN XDAP. USE EDIT CONTIG

        Gives access to the system editor on the machine (for  example
  EDT  on  a  VAX)  and  allows users to edit contigs. The contigs are
  presented as for "display contig" and the program will  reconstitute
  the contig's sequences and relationships  when the editor is exited.

        To screen edit a contig set the line length to 50  characters,
  select  the  contig to edit, and supply the name of a temporary file
  in which the editing will be performed.  After  a  short  pause  the
  system editor will present the first page of the file. Edit the file
  obeying the rules given below. Exit from the editor and  affirm  the
  intention  of returning the contig to the database. The program will
  put the contig back into the database.

        Rules for screen editing

        There are some limitations on the changes that can be made  to
  the contigs when using the screen editor. Users are unlikely to want
  to break the rules in order  to  achieve  changes  to  contigs,  but
  nevertheless  the  constraints need to be defined and they are given
  below.

        Alignments must be maintained during editing.  Whole lines  of
  sequence  should not be deleted or added unless the order of the gel
  readings in the contig  is  preserved.   Each  line  in  the  contig
  display  consists  of  gel  reading  numbers,  their  names  and  50
  character sections  of  sequence.  Insertions  are  limited  in  the
  following  way.  No line of sequence can be extended rightwards more
  than 10 characters beyond the end of a  full  length  line  (a  full
  length  line is 50 characters long). Only one character can be added
  to the left end of full  length  lines,  but  sections  of  sequence
  beginning  further  into  a  line can be extended leftwards up to an
  equivalent position. Do not delete any  non-sequence  lines  in  the
  file.

        Before returning the contig to the database the program checks
  that  the rules have been obeyed. If an error is found the number of
  the erroneous line in the file is displayed and the contig will  not
  be changed.
 @5. TX 1 @Display a contig

        Used to show the aligned  gel  readings  for  any  part  of  a
  contig.  The  number,  name  and strandedness of each gel reading is
  shown and the consensus is written below.

        If required identify the contig,  and then the start  and  end
  points of the region to display.

        The display can be directed  to  a  disk  file  using  "direct
  output to disk".  These files are required by options: "screen edit"
  and "highlight disagreements", and printed copies of them  are  very
  useful for marking corrections prior to using the editors.

        Below is an example showing the left  end  of  a  contig  from
  position   1  to  200.  Overlapping this region are gels 6,3,5,17and
  12; 6, 3 and 5 are in reverse orientation to their archives (denoted
  by  a  minus   sign)  There  are  a  few uncertainty codes and a few
  padding characters  in  the  working  versions,  but  the  consensus
  (shown  below  each page width) has a definite assignment for almost
  every position.

                             10        20        30        40        50
     -6  HINW.010    GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
         CONSENSUS   GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA

                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC

                            110       120       130       140       150
     -6  HINW.010    GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
     -3  HINW.007    GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
     -5  HINW.009                                        GGCAGCGACCGTCG
     17  HINW.999                                           AGCGACCGTCG
         CONSENSUS   GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG

                            160       170       180       190       200
     -6  HINW.010    TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
     -3  HINW.007    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
     -5  HINW.009    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
     17  HINW.999    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
     12  HINW.017                                              GTAGAGC*
         CONSENSUS   TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
 @6. TX 1 @List a text file

        This option allows users to list text files on the screen.  It
  can  be  used  to  read  a file containing notes, for checking files
  written to disk etc. The user is asked to type the name of the  file
  to list.
 @8. TX 1 @Calculate a consensus

        Calculates  a  consensus  sequence   either  for   the   whole
  database or for selected contigs. The consensus is written to a file
  named by the user.
  Supply a file name,  choose  between   whole  database  or  selected
  contigs.

        Symbols for uncertainty in gel readings

        In  order  to  record  uncertainties  when  reading  gels  the
  codes  shown  below can  be  used. Use  of these codes permits us to
  extract the maximum amount of data from each gel and yet record  any
  doubts   by  choice   of   code.    The program can deal with all of
  these codes and any other  characters  in  a  sequence  are  treated
  as  dash  (-) characters.

         SYMBOL                  MEANING

           1             PROBABLY        C
           2                "            T
           3                "            A
           4                "            G
           D                "            C       POSSIBLY        CC
           V                "            T          "            TT
           B                "            A          "            AA
           H                "            G          "            GG
           K                "            C          "            C-
           L                "            T          "            T-
           M                "            A          "            A-
           N                "            G          "            G-
           R             A OR G
           Y             C OR T
           5             A OR C
           6             G OR T
           7             A OR T
           8             G OR C
           -             A OR G OR C OR T
           a             A set by auto edit
           c             C set by auto edit
           g             G set by auto edit
           t             T set by auto edit
           *             padding character placed by auto assembler
            else = -

  The DNA consensus algorithm

        The "calculate  consensus"  function,  the  "display   contig"
  routine and the "show quality" option use  the rules  outlined  here
  to  calculate  a consensus  from aligned gel  readings.   Note  that
  "display  contig"  calculates a consensus for  each  page  width  it
  displays  (it  does  not use the consensus sequence file  calculated
  by the consensus function).

        We  have  6  possible  symbols  in  the  consensus   sequence:
  A,C,G,T,*  and -. The last symbols is assigned if none of the others
  makes up a sufficient proportion of the aligned  characters  at  any
  position  in the contig. The following calculation is used to decide
  which symbol to place in the consensus at each position.

        Each uncertainty code contributes a score to one of  A,C,G,T,*
  and  also  to  the  total  at each point. Symbols like R and Y which
  don't correspond to a single base type contribute only to the  total
  at each point. The scores are shown below.
                definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1

                probable assignments ie 1,2,3,4 = 0.75

                other uncertainty codes including R,Y,5,6,7,8,- = 0.1

        A cutoff score of 51% to 100% is supplied by the  user.  (When
  the   program   starts   this  is  set  to  75%.  See  "set  display
  parameters").  At each position in the contig we calculate the total
  score  for  each of the 5 symbols A,C,G,T and * (denote these by Xi,
  where i=A,C,G,T or *), and also the sum of these totals (denote this
  by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
  in the consensus; otherwise - is assigned.

        Notice that S does not equal the number of times the  sequence
  has  been  determined, but is the score total, and hence we are less
  likely to put a -  in  the  consensus.  For  the  "examine  quality"
  algorithm  each  strand is treated separately but the calculation is
  the same. (It was originally different).

        Format of the consensus sequence ( and vector sequences).

        A consensus  sequence  file  may  contain  the  consensus  for
  several contigs and so we identify each of them by preceding them by
  a 20 character title. The title is of the form  <---LAMBDA.076----->
  (  where LAMBDA is the project name and gel reading number 76 is the
  leftmost gel reading to contribute to  this   consensus   sequence).
  The   angle  brackets  <>  and the three digit number precede by a .
  are important to some processing programs.
 @25. TX 1 @Show relationships

        Used to show the relationships of  the  gel  readings  in  the
  database in three ways -
  (a) All contig descriptor lines  followed  by  all  gel   descriptor
  lines.
  (b) All contigs one after the   other   sorted,   i.e.    for   each
  contig   show  its   contig  descriptor line followed by all its gel
  descriptor lines sorted on position from left to right
  (c) Selected contigs:  show the contig  line  and,  in  order, those
  gel  readings  that  cover  a  user-defined  region.  Note that this
  output can be directed to a disk file by prior  selection  of  "disk
  output".

        Below is an example showing a contig from position 1  to  689.
  The left gel reading  is number 6 and has archive name HINW.010, the
  rightmost gel  reading is number 2 and is has archive name HINW.004.
  On  each  gel  descriptor  line  is  shown:  the name of the archive
  version, the gel number, the position of the left  end  of  the  gel
  reading  relative to the left  end  of  the  contig,  the length  of
  the gel reading  (if this is negative it means that the gel  reading
  is  in  the  opposite orientation to its archive), the number of the
  gel reading   to the left and the number of the gel reading  to  the
  right.


   CONTIG LINES
   CONTIG      LINE  LENGTH               ENDS
                                       LEFT   RIGHT
                 48     689               6       2
   GEL LINES
   NAME      NUMBER POSITION LENGTH     NEIGHBOURS
                                       LEFT   RIGHT
   HINW.010       6        1   -279       0       3
   HINW.007       3       91   -265       6       5
   HINW.009       5      137   -299       3      17
   HINW.999      17      140    273       5      12
   HINW.017      12      193    265      17      18
   HINW.031      18      385   -245      12       2
   HINW.004       2      401   -289      18       0

 @21.  TX 3 @Enter new gel reading

        THIS OPTION IS NO LONGER AVAILABLE IN XDAP. USE AUTO ASSEMBLE

        Used to enter new gel readings into the database. The new  gel
  reading  must have previously been compared with the contents of the
  database by use of " auto assemble"  in order  to  ascertain  if  it
  overlaps any previously entered data.

        The user is expected to know: if the gel reading overlaps;  if
  so  which  contig  it overlaps; if so where it overlaps. The program
  takes the user through a series of question to establish the  nature
  of  the  overlap  and  then  displays  the overlap. The user is then
  offered a number of options,  including  editors  for  the  new  gel
  reading  and  the contig, to enable the correct alignment of the gel
  reading throughout its whole length.
  Supply the name of the gel reading file.  If the  gel  reading   has
  been  entered before the program will  not permit entry. The program
  gives the gel reading a unique  number  and  asks  if  the  sequence
  overlaps  any  data  already  in  the  database  (reported  by "auto
  assemble").  If it does not, entry is complete.  If it does  overlap
  the  dialogue  continues with the program asking if the gel readings
  overlaps "in  the  normal  sense",  if  not  it  will  automatically
  complement  the  sequence.  Then supply the number of the contig the
  gel reading overlaps (as reported by "auto assemble").

        Overlaps are divided into two types: those for which  the  new
  gel  reading  protrudes from the left end of the contig it overlaps,
  and those for which it does not. The program asks  about  this  with
  the  question "Left end of gel reading is inside contig". If this is
  true the program will go on to ask for the position in the contig of
  the  left  end of the new gel reading. If it is not true the program
  will ask for the position in the new gel reading of the left end  of
  the contig.

        Once this is completed the program will display the  first  50
  bases  of  the  overlap.  The  gel  readings in the contig and their
  consensus are displayed with the new  gel  reading  underneath.  The
  mismatches are shown by *'s on the next line down. For example:


                             60        70        80        90       100
     -6  HINW.010    CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
     -3  HINW.007                                            GGCACA*GTC
         CONSENSUS   CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACACGTC
         NEWGEL      CACAAGCGAGCGAGAGGGGCACCGTGACGTGGTCACGCCGGGGACACGTC
         MISMATCH                  *                         * *
                             10        20        30        40        50


        The program then needs to know if the position  of  the   left
  end  of  the  overlap  is  correct.   If  it is the user should type
  return, if not, 1 and the program will ask for the new position  and
  display it.
  The program now offers a number of  options  to  allow  the user  to
  align  the  new gel reading correctly over its whole length with the
  data   already   in   the   contig.    It   is    important     that
  sufficient   edits   are   made   to  the  new  gel  reading  or the
  sequences in the contig at this stage to get the alignment  correct,
  because  once entry  is completed, the alignment is fixed and cannot
  easily be changed (see "alter relationships").  Alignment   can   be
  achieved  by   making  insertions   or  deletions  but  deletion  of
  data requires the original gels to be checked.   For   this   reason
  at  entry  we usually make only insertions to achieve alignment.  We
  use X or asterisks (*) as padding characters  to  achieve  alignment
  and  so   can,  if  required, distinguish  padding  characters  from
  characters assigned from reading gels.

        The options available are:
     ? = HELP
     ! = Give up
     3 = Complete entry
     4 = Edit contig
     5 = Display overlap
     6 = Edit new gel reading



        1. HELP gives this information.

        2. Give up allows users to change their minds  about  entering
  the  new  gel reading. The program will ask the user to confirm this
  choice.

        3. Complete entry is the command to add the new gel reading to
  the  contig.  The program updates the relationships accordingly. The
  user is asked to confirm this command.

        4. Edit contig gives the user access to a simple  editor  that
  allows  insertions,  deletions and changes to be made to the contig.
  The editor  maintains  alignments  by  making  the  same  number  of
  insertions or deletions in all sequences covering the edit position.
  The  program protects the  user           by   allowing  edits  only
  within the region of overlap.

        5. Display allows display of the region of overlap only.  This
  is  defined  by the relative positions in the contig. The default is
  the whole of the region of overlap.

        6. Edit new gel reading allows  the  new  gel  reading  to  be
  edited using a simple editor.
 @23. TX 3 @         Complement a contig

        This function will complement  and  reverse  all  of  the  gel
  readings   in    a  contig.     It    automatically   reverses   and
  complements  each  gel reading sequence,  reorders  left  and  right
  neighbours,   recalculates   relative  positions  and  changes  each
  strandedness.

        The only user  input  required  is  to  identify  the   contig
  to complement  by  the  number or name of a gel reading it contains.
  DO NOT KILL THE PROGRAM DURING THIS STEP!
 @22. TX 3 @          Join contigs

        This function joins contigs interactively using a mouse driven
  editor.   The operation of this editor is very similar to the Contig
  Editor described in "@4 Edit".

        It allows the user  to align the ends of the  two  contigs  by
  editing  each contig separately.  It is important that the alignment
  achieved   is  correct  because  once  the  join  is  completed  the
  alignment  is fixed.  The program needs to know which two contigs to
  join.

        First specify which two contigs are to be  joined.   The  user
  should  identify the two contigs. First the left contig and then the
  right.  The program checks that the two contig numbers are different
  (it will not allow circles to be formed!)

        The Join Editor consists of  two  Contig  Editors  in  between
  which  is sandwiched a disagreement box. This disagreement box shows
  exclamation marks to denote mismatches between the two consensuses.

        For example, the display will look something like this:

                           1460      1470      1480      1490      1500
     56  HINW.100    TCT*GAGCAGTGTGGGCGCTG*CCGG
     33  HINW.300    TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
    -25  HINW.090    TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
     19  HINW.123    TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
         CONSENSUS   TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
         MISMATCH                         !                      !!!!!!
                             10        20        30        40        50
     -6  HINW.010    TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
     -3  HINW.007                TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
     -5  HINW.009                              GCTCGGAGGGCATGAAGT*AGAGC
         CONSENSUS   TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC



        The best  strategy  for  joining  is  to  identify  the  exact
  position  of  overlap.  This  is defined as the position in the left
  contig that the leftmost character of  the  right  contig  overlaps.
  The  overlap  must be of at least one character.  Use the scroll bar
  and the scroll buttons (`<<',`<',`>',and`>>')  for  positioning  the
  relative positions of the two contigs.

        The join position can be fixed in  position  by  pressing  the
  `lock' button at the top of the Join Editor.  Locking allows the two
  contigs to be scrolled as one when using the scroll bar and buttons,
  the left ends always in the same position relative to each other.

        Once locked, it is best to proceed  to  the  right  along  the
  contigs,  inserting padding characters (`*') into the consensuses to
  minimise the disagreements.

        It  is  essential  that  the  user  aligns  the  two   contigs
  throughout  the  whole  region of overlap before completing the join
  because it is only at this stage that the two contigs can be  edited
  independently.  Once the join is completed the alignment can only be
  altered using the routines supplied by "alter relationships".

        The join can be  completed  by  pressing  the  `Leave  Editor'
  button.  The  percentage  mismatch  is  displayed,  and  the user is
  required to confirm that they want to perform the join.
 @24. TX 1 @               Copy the database

        Used to make a copy of the database. If required the  database
  size  can  be altered using this option. The "version" of a database
  is  encoded as the last letter in the names of the five  files  that
  contain the database.

        Supply a "version" number (the default is version 1),  and  if
  required  select a new size for the database. The size of a database
  is the number of lines of information it can hold. It needs  a  line
  for each gel reading and another for each contig.
 @19. TX 1 @               Check database

        Used to perform a check on  the  logical  consistency  of  the
  database. No user intervention is required.

        The following relationships are checked:
  1.       If gel reading A thinks gel reading B is its left neighbour
  does B think A  is its right neighbour?  The error message is
  "Hand holding problem for gel reading A"
  followed by  the gel descriptor lines for gel readings A and B.
  2.       Are there any contig lines with no left or  right  end  gel
  readings?  The error message is
  "Bad contig line number A"
  3.       Do the gel readings that are  described  as  left  ends  on
  contig lines agree that they are left ends?  The error message is
  "The end gel readings of contig A have outward neighbours"
  4.       Are there gel readings that are in more  than  one  contig?
  The error message is
  " Gel number A is used N times"
  5.       Are there gel readings that are not  in  any  contig?   The
  error message is
  " Gel number A is not used"
  6.       Do the relative positions of   gel  readings   agree   with
  their  position  as  defined by left and right neighbourliness?  The
  error message is
  " Gel number A with position X is left neighbour of  gel  number   B
  with position Y"
  7.       Are there any loops in   contigs?    If   so   no   further
  checking is done.  The error message is
  " Loop in contig n no further checking done, but gel reading numbers
  follow"
  The program  then  prints the gel  reading  numbers  in  the  looped
  contig up to the start of the loop.
  8. Are there any contigs of length <1? The error message is
  " The contig on line number x has zero length"
  9. Are there any gel readings (used in only one  contig)  that  have
  zero length? The error message is
  " Gel number N has zero length"
  Note that "auto assemble"  also uses this logical consistency  check
  and will only tolerate a "Gel number N is not used" error. Any other
  error will cause it to give up.
 @29. TX 1 @               Examine quality

        Analyses the quality of the data in a contig.  It  reports  on
  the  proportion  of the consensus that is "well determined" and will
  display a sequence of symbols  that  indicate  the  quality  of  the
  consensus at each position.

        Identify the contig to analyse, and the section  of  interest.
  The  current  consensus  calculation  cutoff  score  will be used to
  decide if each position is "well determined". In general the quality
  of  a  reading deteriorates along the length of the gel and so it is
  also possible to use a length cutoff for  the  quality  calculation.
  Only  the  data  from  the  first  section  of  each reading will be
  included in the quality calcualtion. The  length  is  altered  under
  "set parameters" and is initially set to the maximum reading length.
  A summary showing the percentage of the consensus  that  falls  into
  each category of quality is shown. Choose whether or not to have the
  quality codes for each position of the consensus displayed. They can
  be displayed as either graphics or text.

        The quality of the data depends on the number of times it  has
  been  sequenced  and the particular uncertainty codes  used  in each
  gel reading.  This function divides the data into  five  categories,
  assigning each a symbol or code:
  1.  Well determined on both strands and they agree.  code=0
  2.  Well determined on the plus strand only.  code=1
  3.  Well determined on the minus strand only.  code=2
  4.  Not well determined on either strand.  code=3
  5.  Well determined on both strands but they disagree.  code=4
  A position is "well determined" if it is assigned one of the symbols
  A,C,G,T  when  the  algorithm  described in the section "calculate a
  consensus".   The  calculation  is  performed  separately  for  each
  strand.

        If the user chooses to have the data displayed graphically the
  following  scheme  is used. A rectangular box is drawn so that the x
  coordinate  represents  the  length  of  the  contig.  The  box   is
  notionally divided vertically into 5 possible levels which are given
  the y values: -2,-1,0,1,2.  The quality  codes  attributed  to  each
  base  position are plotted as rectangles.  Each rectangle represents
  a region in which the quality codes are identical, so a single  base
  having a different code from its immediate neighbours will appear as
  a very narrow rectangle.

    Rectangle bottom and top y values

       Quality 0 rectangle from 0 to 0
       Quality 1 rectangle from 0 to 1
       Quality 2 rectangle from 0 to -1
       Quality 3 rectangle from -1 to 1
       Quality 4 rectangle from -2 to 2

        Obviously a single line  at  the  midheight  shows  a  perfect
  sequence.

        Typical dialogue is shown below.

     41.47% OK on both strands and they agree(0)
     55.48% OK on plus strand only(1)
      2.08% OK on minus strand only(2)
      0.97% Bad on both strands(3)
      0.00% OK on both strands but they disagree(4)
    ? (y/n) (y) Show sequence of codes

             10         20         30         40         50
     1111111111 1111111111 1111111111 1111111111 1111111111

             60         70         80         90        100
     1111111111 1111111111 1111111111 3111111111 1111111111

            110        120        130        140        150
     1111111111 1111131111 1111111111 1111111111 1111111111

            160        170        180        190        200
     1111111111 1111111111 1111111111 1111111111 1111111133

            210        220        230        240        250
     1311111111 1111111111 1111111110 0000000000 0000220000

            260        270        280        290        300
     0000000000 0020000000 2200000202 0002000000 0000222200

 @26. TX 3 @              Alter relationships

        Used  to  make  what  are  normally  illegal  changes  to  the
  database. That is the normal checks are not done and any item in the
  database can be changed independently of all others. Users  need  to
  know  what they are doing because it is very easy to make a horrible
  mess. Always start by making a copy!

        By using the  options  here  users  can  edit  individual  gel
  readings  in  contigs,  move  one  section  of  a contig relative to
  another, break contigs, remove contigs, remove gel readings, etc. To
  give  flexibility most of the commands do only one thing. This means
  that several commands may  have  to  be  executed  to  complete  any
  change.  At the end of this help section there are notes on removing
  gel readings from the database.

        The following options are offered:

     Cancel
     Line change
     Edit single gel reading
     Delete contig
     Shift
     Move gel reading
     Rename gel reading
     Break a contig
     Alter raw data parameters

  1. QUIT returns to the main options of SAP.
  2. Line change
  allows the user to change the contents  of  any line in the file  of
  relationships.   The  line is selected by number, the program prints
  the current line and prompts for the new  line.
  3.   Edit
  allows   the   user   to    edit    an    individual    gel  reading
  independently of any others it may be related to. The edit positions
  are relative to the contig. The effect of this editing on the length
  of the gel reading is taken care of but, if it changes the length of
  a contig, or its relationship to others, this must be accounted  for
  (if necessary) by use of the "line change" function.
  4.  Delete  contig
  is a function that deletes a contig line  by moving  down  all   the
  contig lines above by one position.  It prompts only for the line to
  delete.  It does not  delete  any   of   the  gel  readings  or  gel
  reading  lines  for the deleted contig but it does reduce the number
  of contigs on line IDBSIZ by 1.
  5.  Shift
  allows the user to change all the relative  positions of  a set   of
  neighbouring  gel  readings by some fixed value, i.e.  it will shift
  related gel readings either left or  right.   It  can  therefore  be
  used  to  change the alignment of the gel readings in a contig or as
  part of the process of breaking a contig into two parts (see below).
  It  prompts  for  the  number  of the first gel reading to shift and
  then  for the  distance  to  move  them (Note a negative value  will
  move  the  gel readings left and a positive value right).   It  then
  chains rightwards (ie follows right neighbours) and shifts each  gel
  reading,  in  turn,  up to the  end of the contig.  (This means that
  only those gel readings from the first to shift to the rightmost are
  moved). It updates the length of the contig accordingly.
  6. Move gel reading
  is  a  function  to  renumber  a  gel  reading.  It  moves  all  the
  information  about  a  gel reading on to another line. The user must
  specify the number of the gel  reading to move and the number of the
  line  to place it. It takes care of all the relationships. Of course
  gel readings must not be  moved  to  lines  occupied  by  other  gel
  readings!  It  can  be used as part of the process of removing a gel
  reading from the database (see below).
  7.  Rename gel reading
  is a function that is used to  rename  the archive   names   of  gel
  readings   in  the  database;   it only changes the name in the .ARN
  file of the  database.

  8. Break contig

        Occasionally it is necessary to break a contig into two  parts
  and  this  can be achieved using this option. The program needs only
  the number of a gel reading. This is  the  gel   reading  that  will
  become  a  left  end  after  the  break.  That is, the break is made
  between this gel reading and its left neighbour. A new  contig  line
  is created so ensure that there is sufficient space in the database.
  Removing gel readings from contigs

        Gel readings can be removed  from  contigs  if  they  are  not
  essential  for  holding the contig together (ie are not the only gel
  reading covering a particular region). Suppose the  gel  reading  to
  remove  is gel number b with left neighbour a and right neighbour c.
  Using "line change" change the right neighbour of a to  c,  and  the
  left neighbour of c to a. To tidy things up: suppose there are x gel
  readings in the database; then, using "move gel reading" move gel  x
  to  line  b;  then,  using  "line change" decrease the number of gel
  readings in the database (stored in the last line) by 1.

  8. Alter raw data parameters

        Allows the user to edit the individual  raw  data  parameters,
  such  as  the  left  and  right  cutoff  lengths and the name of the
  machine readable trace file.  The user must specify the gel line  to
  modify,  and  provide  new values for the length of the raw sequence
  including cutoff lengths, the left cutoff position,  the  length  of
  the original working sequence, the machine type, and the name of the
  raw data file, where these values change.
 @27. TX 1 @  Set display parameters

        Used to  redefine  the  parameters  that  control  the  cutoff
  employed  by  the  consensus  calculation  and quality examiner, the
  maximum  length  of  each  reading  to  include   in   the   quality
  calculation,  the line length used by the display function, the text
  window length used by the graphics options, and the graphics  window
  length used by the graphics options.

        The default cutoff score is 75%. The default line length is 50
  characters. For protein sequences the cutoff is always 100%.

        The text window used by  the  graphics  options  controls  the
  amount  of  sequence  listed at the crosshair position. The graphics
  window controls the "zoom" function. Both these windows are  defined
  as  the number of bases that should be shown, to both left and right
  of the crosshair.
 @30. TX 3 @  Auto edit a contig

        This function automatically changes characters in gel readings
  to  make  them  agree with the consensus sequence. If employed as is
  intended, use of this function is not  a  criminal  activity  but  a
  method  that saves a large amount of work. All characters changed by
  the auto editor  will  appear  in  the  gel  readings  as  lowercase
  letters. The current consensus calculation cutoff score is used.

        Identify the contig and the section to edit. The program  will
  display  a  summary  of  changes  made. Note that it is important to
  understand both what the auto editor does and the order in which  it
  does  it. Before employing the auto editor users should note all the
  corrections that they require, so that  after it has been  used  the
  corrections can be checked.

        The general strategy employed when collecting shotgun sequence
  data  is  to let the contigs get fairly deep, to get a printout of a
  contig, check problems against the films, note  corrections  on  the
  printout,  and  make  the  changes  using  an interactive editor. In
  general the consensus is correct except  for  places  where  padding
  characters  have been used to accommodate a single gel with an extra
  character, or where the consensus is dash. The important  point  for
  the  auto  editor  is  that  most edits simply make the gel readings
  conform to the consensus, or remove columns of pads.

        The new editor does the following.

        1) calculates a consensus for the contig (or part of a contig)
  to  be edited, and then uses this consensus to direct the editing of
  the contig in 3 stages

        2) stage 1: find and correct all places where, if the order of
  two  adjacent  characters  is swapped, they will both agree with the
  consensus (given that they did  not  match  the  consensus  before).
  These corrections are termed "transpositions"

        3)  stage 2: find and correct all  places  where  there  is  a
  definite  consensus  but  the gel reading has a different character.
  These corrections are termed "changes".

        4) stage 3: delete all  positions  in  which  padding  is  the
  consensus. These corrections are termed "deletions".

        All changed characters are shown in lowercase  letters  so  it
  will  be  obvious which characters have been assigned by the program
  (except for deletions). The number of each type of  correction  will
  be displayed.
 @10. TX 2 @Clear graphics

        Clears graphics from the screen.
 @11. TX 2 @Clear text

        Clears  text from the screen.
 @12. TX 2 @Draw a ruler.

        This option allows the user to draw a ruler or scale along the
  x  axis  of the screen to help identify the coordinates of points of
  interest. The user can define the position of the first base  to  be
  marked  (for  example if the active region is 1501 to 8000, the user
  might wish to mark every 1000th base starting at either 1501 or 2000
  -  it  depends  if  the user wishes to treat the active region as an
  independent unit with its own numbering starting at its  left  edge,
  or  as  part  of  the  whole sequence). The user can also define the
  separation of the ticks on the scale and their height.  If  required
  the labelling routine can be used to add numbers to the ticks.
 @14. TX 2 @Reposition plots

        The positions of each of the plots is defined  relative  to  a
  users  drawing board which has size 1-10,000 in x and 1-10,000 in y.
  Plots for each option are drawn in a window  defined  by  x0,y0  and
  xlength,ylength. Where x0,y0 is the position of the bottom left hand
  corner of the window, and xlength is the width  of  the  window  and
  ylength the height of the window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The  default  window  positions are read from a file "ANALMARG" when
  the program is started. Users can have their own file  if  required.
  As  all  the plots start at the same position in x and have the same
  width, x0 and xlength are the same for all options. Generally  users
  will  only  want  to change the start level of the window y0 and its
  height ylength. This option allows users to change window  positions
  whilst  running  the  program.   The  routine  prompts first for the
  number of the option that the users wishes to reposition;  then  for
  the  y  start and height; then for the x start and length. Note that
  changes to the x values affect all options. If the user  types  only
  carriage  return  for any value it will remain unchanged. Note that,
  unlike all the other programs, the boxes used to contain  analytical
  results (eg plot quality) should not be made to overlap one another,
  as the function of the crosshair routine depends on  which  box  the
  crosshair is in!
 @15. TX 2 @Label a diagram

        This routine allows users to  label  any  diagrams  they  have
  produced.  They  are  asked  to type in a label. When the user types
  carriage return to finish typing the label the cross-hair appears on
  the  screen. The user can position it anywhere on the screen. If the
  user types R (for right justify) the label will be  written  on  the
  diagram  with  its right end at the cross-hair position. If the user
  types L (for left justify) the label will be written on the  diagram
  with  its  left end at the cross hair position.  The cross-hair will
  then immediately reappear. The  user  may  put  the  same  label  on
  another part of the diagram as before or if he hits the space bar he
  will be asked if he wishes to type in another label.

        Typical dialogue follows.
  ? Menu or option number=15
  Type label then drive cross hair to left or right end
  of label position then hit  "L"  to  write label left
  justified or  "R"  to  write label right justified or
  the space bar to quit


  ? Label=delta gene

   missing graphics

  ? Label=

 @16. TX 2 @Display a map

        This draws a map of any  sequence  features  selected  by  the
  user.   These  features  may  be  protein coding regions (CDS), tRNA
  genes (TRNA), promoter positions (PRM), etc. Users may define  their
  own  feature  table  key  names. For example I find it convenient to
  split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
  those  sequences  that  code in the reading frames 1, 2 or 3. Then I
  can plot them at different heights on the screen ( suitable  heights
  can be determined by using the cross-hair).  The coordinates must be
  stored in a file in the format of an EMBL feature table.

        Typical dialogue follows.
  ? Menu or option number=16
   Display a map using an EMBL feature table file
  ? map file name=hsegl1.ft
  ? feature code(e.g. CDS) =CDS
  X 1 + strand
    2 - strand
    3 both strands
  ? 0,1,2,3 =
  ? level (0-9480) (256) =4000

   missing graphics

  ? feature code(e.g. CDS) =

 @7. TX 1 @Redirect output

        Used to direct output that would normally appear on the screen
  to a file.

        Select redirection of either text or graphics, and supply  the
  name of the file that the output should be written to.

        The results from the next options selected will not appear  on
  the  screen  but  will  be  written  to  the  file. When option 7 is
  selected again the file will be closed and output will again  appear
  on the screen.
 @13. TX 2 @Use crosshair

        This option puts a steerable cross on  the  screen  which  the
  user  drives  around  by  using  the arrow keys (or mouse). When the
  crosshair is visible a number of options are available if  the  user
  types  one  of  a  set  of  special  keyboard  characters. Any other
  characters will cause an exit from the crosshair option. The special
  keys are:

      I = Identify the nearest gel reading
      Z = Zoom in
      Q = plot Quality
      S = display the aligned Sequences at the crosshair position
      N = list the Names and Numbers of the sequences at the crosshair

        In order for  any  of  these  special  keys  to  operate,  the
  crosshair  must  be  in  an appropriate display box, and the precise
  function of the keys will also depend on which box the crosshair  is
  in.

        If the crosshair is in the "plot  all  contigs"  box,  Z  will
  cause  a  new box to appear showing all the readings for the nearest
  contig; Q will give the same as Z but will also produce an extra box
  showing the "quality" plot.

        If Z is hit in the "plot single contig" box, the  contig  will
  be  zoomed  to  the  current  graphics window size. The zoom will be
  roughly centred on the crosshair position. Because  of  this  it  is
  possible  to  step  along  a  contig  by repeatedly zooming with the
  crosshair near to one end of the single contig display box. If I  is
  hit  the crosshair must be close to a gel reading line. If Q is hit,
  the quality plot will be produced for the region shown in  the  plot
  single  contig  box. In all cases when the "plot all contigs" box is
  shown, a vertical line will  bisect  the  line  the  represents  the
  relevant contig, at the current position.

        If the crosshair is in the plot quality box only the character
  "s" will operate as a special symbol.

        The number of bases shown in the N and S options is controlled
  by  the  current graphics text window size, and the size of the zoom
  window by the current graphics window size.  Both  are  set  by  the
  parameter setting function of the general menu.
 @33. TX 2 @Plot single contig

        This option produces a schematic of a  selected  region  of  a
  single  contig by drawing a horizontal line to represent each of its
  gel readings. The lines show the relative positions of each  reading
  and  also  their  sense.  The  plot  is  divided vertically into two
  sections by a line that is identified by an asterisk drawn  at  each
  end.  All lines that lie above this line represent readings that are
  in their original sense, all lines below show readings that  are  in
  the  complementary  sense to their original. By use of the crosshair
  function the plot can  be  stepped  through  and  examined  in  more
  detail. See help on crosshair.
 @34. TX 2 @Plot all contigs

        This option produces a schematic  of  all  the  contigs  in  a
  database.  It  does  this  by drawing a horizontal line to represent
  each of them. In order to show the ends of each contig it draws  the
  lines for contigs at alternate heights: the first at height one, the
  second at height two, the third at height one, etc. The order of the
  contigs  in  the display is the same as their order in the database.
  By use of the crosshair function the plot can be stepped through and
  examined in more detail. See help on crosshair.
 @31. TX 3 @ Type in gel readings

        THIS OPTION IS NO LONGER AVAILABLE IN XDAP.

        This option  allows  gel  readings  to  be  typed  in  at  the
  keyboard. It creates a separate file for each gel reading and a file
  of file names for the batch. The sequences from each  batch  may  be
  listed  when  they have all been entered. Users may choose to employ
  special keys to identify the 4 bases A,C,G and T. By  default  these
  special  keys are N M , . but any other four characters may be used.
  If special keys are used the characters are automatically translated
  to A C G T before being stored on the disk.
 @35. TX 1 3 @Find internal joins

        The purpose of this function is to use  data  already  in  the
  database  to  find  possible  joins between contigs.  Joins may have
  been missed due to poor data or  may  have  not  been  made  due  to
  repeated  sequences.  Where  appropriate, it may be possible to find
  potential joins by using the data  clipped  off  readings  prior  to
  their entry into the database.
  The database is checked for logical consistency.  Supply  a  minimum
  initial  match  length,  a minimum alignment block, the maximum pads
  per sequence, the maximum  percent  mismatch  after  alignment,  the
  probe length. Choose if clipped data is to be used, if so define the
  window size for finding good data and the number of  dashes  allowed
  in  the  window. Processing will commence.  Most of these values are
  used in an identical way in the autoassemble  function.  The  others
  are defined below.
  The program strategy
  Take the first contig and calculate its consensus. If  clipped  data
  is  being  used  examine  all readings that are in the complementary
  orientation, and sufficiently near to the contigs left end,  to  see
  if  they have good clipped sequence which if present, would protrude
  from the left end of the contig.  If  found  add  the  longest  such
  sequence to the left end of the consensus. Do the same for the right
  end by examining readings that are in their original orientation. If
  any  are  found  add  the  longest extension to the right end of the
  consensus. Repeat the consensus calculations and extensions for  all
  contigs  hence  producing  an extended consensus. If clipped data is
  not  being  used  simply  calculate  the  consensus  for  the  whole
  database.  Now  look  for  possible joins by processing the extended
  consensus in the following  way.  Take  the  last,  say  100,  bases
  (termed  the  "probe  length"  by  the  program)  of  the  rightmost
  consensus, compare it both orientations with the extended  consensus
  of  all the other contigs. Display any sufficiently good alignments.
  Repeat with the left end of the rightmost contig. Do  the  same  for
  the ends of all the entended contigs, always only comparing with the
  contigs to their left, so that the same matches do not appear twice.
  Good cliped data is defined by sliding a window of "Window size  for
  good  data scan" bases outwards along the sequence and stopping when
  "Maximum number of dashes in scan window" or more dashes  appear  in
  the  window.   Note that it is advisable to have some sort of cutoff
  because if we simply take all the  data  it  might  be  so  full  of
  rubbish  that  we wont find any good matches. For the same reason it
  is worth trying the procedure with different cutoffs. An initial run
  using  no  clipped  data  is  also  recommended.   Sufficiently good
  alignments are defined by  criteria  equivalent  to  those  used  in
  autoassemble,  however here we only display alignments that pass all
  tests.
  Bugs
  If a small contig is wholly contained within a larger one, such that
  its  ends  are further than ("Probe length" - "Minimum initial match
  length") from the ends of the larger contig, and the  consensus  for
  the small contig lies to the left of the consensus for large contig,
  the overlap will not be discovered. (See the search stratgey).
  All numbering is relative to base number one in the contig:  matches
  to  the  left  (i.e.  in  the clipped data) have negative positions,
  matches off the right end of the contig (i.e. in the  clipped  data)
  have  positions  greater  than  that  of  the  contig  length.   The
  convention for reporting the positions of overlaps is as follows: if
  neither  contig needs to be complemented the positions are as shown.
  If the program says "contig x in the -  sense"  then  the  positions
  shown  assume  contig  x  has  been complemented. For example in the
  results given below the positions  for  the  first  overlap  are  as
  reported,  but  those  for  the second assume that the contig in the
  minus sense (i.e. 443) has been complemented.


   Possible join between contig   445 in the + sense and contig   405
   Percentage mismatch after alignment =  4.9
          412        422        432        442        452        462
       405  TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
             ********* * ********  ***** *** ********** ********** **********
       445  -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
         -127       -117       -107        -97        -87        -77
          472        482        492        502        512
       405  TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
            ********** ********** ********** ********** **
       445  TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
          -67        -57        -47        -37        -27
   Possible join between contig   443 in the - sense and contig   423
   Percentage mismatch after alignment = 10.4
           64         74         84         94        104        114
       423  ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
            **** ***** ********** ********** ******  ** ***** **** *********
       443  ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
         3610       3620       3630       3640       3650       3660
          124        134        144        154        164
       423  TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
            *** ****** ********** ** *******  *** ***** ** **
       443  TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
         3670       3680       3690       3700       3710


 @ end of help