\documentstyle[12pt]{article} \title{A trace display and editing program for data from fluorescence based sequencing machines} \author{Timothy Gleeson \and LaDeana Hillier} \begin{document} \maketitle \section*{} \subsection*{} \subsubsection*{ABSTRACT} ``Ted'' ({\em T}race {\em ed}itor) is a graphical editor for sequence and trace data from automated fluorescence sequencing machines. It provides facilities for viewing sequence and trace data (in top or bottom strand orientation), for editing the base sequence, for automated or manual trimming of the head (vector) and tail (uncertain data) from the sequence, for vertical and horizontal trace scaling, for keeping a history of sequence editing, and for output of the edited sequence. Ted has been used extensively in the C. elegans genome sequencing project, both as a stand-alone program and integrated into the Staden sequence assembly package, and has greatly aided in the efficiency and accuracy of sequence editing. It runs in the X windows environment on Sun workstations and is available from the authors. Ted currently supports sequence and trace data from the ABI 373A and Pharmacia A.L.F. sequencers. \subsubsection*{INTRODUCTION} Time involved in sequence editing is extensive, and anything easing that burden will improve the efficiency of any major sequencing project. Having sequence and trace data available online in easily- manipulable form is invaluable. Ted (a Trace-EDitor) was developed to fill this role in the C. elegans genome sequencing project [1]. \subsubsection*{METHODS} {\em Computing Design and Implementation.} When designing ted, we had a number of specific computing goals in mind including portability and adaptability. For portability, we chose to write ted in ANSI C using the X windowing system and the Xaw toolkit. X provides basic capabilities for the creation and use of windows, and the toolkit contains a number of pre-packaged components, such as the ``sliders'' used for scrolling. X also allows site, user and per-run defaults to be set. Adaptability is also an important goal since we are providing a new function to research groups who are constantly adding new requirements. Stylistically, we have followed an ``Abstract Data Type'' discipline. In this discipline, a program is split into a number of modules which provide separate, well-defined functions. We separate the interface of a module from its implementation. For example, a unified internal sequence format is used. This can store a varying amount of information. However, there is a clear and simple interface by which the rest of the program accesses this module. Such a style is not well supported by C, but its adoption has been very successful. The addition of new sequencing machines, and thus new external data formats, may cause some changes in the internal representation of the sequence but should not affect the rest of the program. Ted accepts a large number of optional command line arguments, many of which can also be specified as system defaults. This supports a mode of working whereby ted is invoked not directly by the user but instead by a script or another application which supplies arguments appropriate to the editing task. {\em Graphical Interface.} Ted currently accepts data from two fluorescence based sequencing machines, the Pharmacia A.L.F. and the ABI 373A. The sequencing machine data consists of four traces of fluorescence levels together with the machine's interpretation, which is a sequence of bases. Ted displays the traces and the machine-generated base list. A second, initially identical, list of bases is provided for correction by the user. Ted has an X windows based graphical interface. The trace file can either be input from the command line or by clicking on the INPUT button after the program has been invoked. Other parameters which the user may specify on the command line include: the output file name; a base position or sequence string on which the trace is to be centered; a default trace magnification; a 5' vector sequence for automated elimination of the sequence head (vector); top or bottom strand orientation; or any of the usual X-window parameters (e.g. display, geometry...). The graphics display (Figure 1) consists of the control panel, the base position information, the original and edited sequence data, and the graphical representation of the trace. The user may begin by using the control panel INPUT button to input a new trace file at which time the user selects whether to view the sequence and trace in top or bottom strand orientation. The trace file is displayed and, if a 5' vector sequence has been specified on the command line, the program attempts to select a cutoff point corresponding to the vector sequence at the ``head'' of the trace file. The bases beyond the ``cutoff'' point are displayed on a shaded background. The user may modify the cutoff position by clicking on the ``Adj left cut'' button and clicking on the position of the desired cutoff. Similarly, the user may adjust the right cutoff of the sequence (chosen by starting at the 5' end of the sequence and looking for the first occurrence when 2 out of 5 bases are 'N') by scrolling along the sequence to that point, clicking on the ``Adj right cut'' button, and clicking on the appropriate base. Automation of the ``cutoff'' process is optional; the user may compile the program with that feature turned ``off.'' Clicking on the ``Edit seq'' button allows the user to enter the edit mode. The ``Search'' button can be used to skip from ``problem'' to ``problem'' (i.e., ambiguity to ambiguity) or to look for runs of identical bases (e.g., TTTT) which are often mis-called by the machine software. Bases can be inserted, deleted, or replaced as with any ordinary word-processor. In difficult-to-read areas, the trace may be vertically or horizontally scaled by dragging or clicking on the magnification scroll bar or by clicking on the vertical scaling buttons (``Scale down'', ``Scale up''), respectively. Finally, the edited sequence is saved to an ascii file using the ``Output'' button. A history of the editing session can also be saved along with the sequence. The ``Quit'' button is used to exit the program. When reinvoking ted on an edited trace file the edited base sequence, rather than the original sequence, is shown in the edited base window. The user may invoke ted by calling in any one of the previous editing sessions. \subsubsection*{APPLICATIONS AND CONCLUSIONS} In the C. elegans genome sequencing project, data from the ABI or A.L.F. sequencing machines' computers are transferred to Sun workstations. The user invokes a Unix shell script that calls ted systematically on each of the new set of trace files creating a set of sequence files. The sequence files that are deemed to be of acceptable quality are then entered into the sequence assembly program xdap [2] where the sequences are assembled into contigs. Portions of the ted trace-editor have been incorporated into the xdap ``trace manager,'' which is used in conjunction with the contig editor to view sets of aligned traces at sites of discrepancies in the aligned sequences. Ted is also used at the stage of choosing oligo primers for the ``walking'' stage of the sequencing project. It can be invoked directly from the oligo selection program, osp [3], to allow examination of the trace data in the region of the primers so that integrity of the sequence data can be verified. Currently, no other programs are known to be available which support editing of the ABI trace data. Further, the modular design of the program should allow support for new types of sequencing machines, with new data formats, to be implemented in a straightforward fashion. \subsubsection*{AVAILABILITY} Ted is freely available from the authors or from Rodger Staden and Simon Dear (MRC Laboratory of Molecular Biology, Hills Road, Cambridge, UK, CB2 2QH) for use on Sun workstations running X-windows (or OpenLook). \subsubsection*{ACKNOWLEDGMENTS} The authors would like to thank all members of the C. elegans sequencing project with special thanks to the following people: John Sulston, Bob Waterston, Phil Green, Rick Wilson, Richard Durbin, Simon Dear, and Rodger Staden for their helpful suggestions for improvements in the ted interface and for their parts in the development of ted. This work was supported by the Medical Research Council and NIH grant R01-HG00136. \subsubsection*{REFERENCES} 1. Waterston, R., Sulston, J., et al. (1991), in preparation. 2. Dear, S. and Staden, R. (1991) Nuc. Acids Res., in press. 3. Hillier, L. and Green, P. (1991) submitted. {\bf Figure 1 legend.} Figure 1 shows a ``screen dump'' of the ted graphical interface. The display consists of the control panel and the synchronized view of the base position information, original and edited sequence data, and graphical representation of the trace (with each nucleotide's trace being represented by a different color). The control panel allows the user to read in new trace files (in either bottom or top strand orientation) as well as to search for a string of nucleotides or a certain base position. Scroll bars allow the user to adjust the magnification of or scroll through the sequence and trace data. The user may also choose to change the vertical magnification of the trace data. Further, sequence on the head (vector) or tail (uncertain data) of the sequence may be ``cutoff'' using the adjust left and right cutoff buttons. Bases can be inserted, deleted, or replaced as with any ordinary word-processor in the sequence data window. Finally, the sequence may be written to an ascii file using the output button on the control panel. \end{document}