staden-lg/src/cop/COP.GUIDE

80 lines
2.7 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Checking Xdap Databases For Errors
Using COP Version 1.1
Simon Dear
16 March 1992
0. Introduction
The program cop checks for editing errors in xdap project databases.
It uses a robust method that can detect insertions, deletions and
changes that have been inadvertently made. In later versions places
where there is reliant on traces of insufficient quality will be
detectable also.
1. Usage
The program allows the user to specify, the project name, the project
version, the consensus calculation cutoff percentage and a search path
for where traces are to be found:
cop [-p project]
[-v version]
[-c consensus_cutoff_percentage]
[-r raw_data_search_path]
[-h]
An example: cop can be run on F59B2.??0 with the command:
cop -p f59b2 -v 0 -r ~mmm/F59B2 -c 66
If the project and/or version are not specified, the user is prompted
for them. The default consensus cutoff percentage is 100%
If a trace file cannot be found in the current working directory and
the -r option is not used, the environment variable RAWDATA is used to
find the file.
2. How cop works
Cop works on a problem exclusion principle. It ignores problem areas
(places where there are insertions, deletions, changes, or where the
trace quality is poor) and concentrates on identifying places where
the coverage is good. It then reports regions where coverage is poor.
Unfortunately it isn't possible to provide explanations using this
approach.
The algorithm is as follows, and is performed on each contig.
a) The consensus for the contig is calculated and a "coverage"
array (to record areas of good coverage) is initialised.
b) Each gel reading in the contig is investigated. Information about
the trace file (its name, and size of cutoffs) is read from the
database. The trace file is read in.
c) The consensus of the region in which the gel reading lies is
aligned with the clipped trace sequence. If necessary, the consensus
is complemented. The alignment is performed using Myers and Miller's
algorithm [1], in the incarnation supplied in the fasta package.
d) A map is made relating the bases in the raw sequence and the bases
in the consensus. Places where trace quality is poor are removed from
this map. For each region in the consensus where there is perfect
alignment (with no deletions, insertions, changes but are mapped) the
coverage array is updated. Each entry in this array represents a pairs
of adjacent bases, and both must be adjacent in the alignment for the
entry to be marked as covered.
e) Once all the readings in the contig have been processed, all gaps
in the coverage are reported.
A. References
[1] Myers, E.W. and Miller, W. 1988. Optimal alignments in linear
space. CABIOS 4(1):11-17.