gde_linux/CORE/xylem/xylem.doc

126 lines
6.4 KiB
Plaintext

XYLEM.DOC update 10 Aug 1994
XYLEM: TOOLS FOR MANIPULATION OF GENETIC DATABASES
Brian Fristensky, University of Manitoba
Fristensky, B. (1993) Feature expressions: creating and manipulating
sequence datasets. Nucleic Acids Research 21:5997-6003.
SPLITDB - Splits files containing one or more GenBank entries into
annotation, sequence, and index files. Indexfiles can also serve as
namefiles for GETLOC. Sequence files are in the format required for
use with the Pearson programs (FASTA,LFASTA etc.).
GETLOC - Reads a file containing LOCUS names (namefile) and
retrieves either annotation, sequence, or both from a split
database or database subset created by SPLITDB.
FETCH - A c-shell script that provides a convenient menu-driven
front end for retrieval of database entries using GETLOC.
FINDKEY - A c-shell script that provides a convenient menu-driven
front end for keyword searches of database annotation files,
using IDENTIFY.
IDENTIFY- Given line-numbered output from grep, IDENTIFY uses the
index file to determine which entries contained the keywords
searched for by grep. It then produces a namefile for use by
GETLOC. Namefiles can serve as logical databases, and utilities
such as the Unix comm command can perform logical operations on
these namefiles to produce database subsets.
FEATURES/GETOB - Given a namefile, pulls objects (mRNA, tRNA, CDS
etc.) from each of the named entries, using the new
DDBJ/EMBL/GenBank International Features Table Format. A future
version will also allow the annotation of sites within objects that
are extracted.
DBSTAT - Calculates amino acid frequencies in a protein database.
RIBOSOME - Given a file of one or more nucleic acids (eg. output
from GETOB) , RIBOSOME translates them into protein, using either
the universal genetic code or an alternative genetic code supplied
by the user. All ambiguities that can be resolved are translated.
PROT2NUC - reverse translates a sequence from protein to nucleic
acid, using IUPAC-IUB ambiguity codes.
SHUFFLE - Given a random seed, shuffles each sequence in a Pearson-
format (.wrp) file. Shuffling is done locally in overlapping windows
across the length of a given sequence. The window size and overlap
length can be specified by the user.
REFORM - Reformats multiply aligned nucleic acid or protein
sequences for publication. Output for M. Waterman's RALIGN
program, or the MBCRR MASE editor, can be directly used as input.
A variety of options are available for representing gaps, consensus
sequences and other features.
Fristensky (Cornell) Sequence Analysis Package - General purpose
sequence analysis package written in Standard Pascal. Features
include: sequence numbering, formatting, & translation, restriction
site searches & mapping, matrix similarity searches, TESTCODE
analysis, base composition analysis. All programs are interactive
and read free-format, BIONET, and GenBank files.
XYLEM DATABASE TOOLS
----------
| .gen | getloc
|----------|<--------------------------
| GenBank | |
---------- |
| |
| splitgb |
/|\ |
/ | \ |
/ | \ |
/ | \ |
/ | \ |
/ | \ |
v v v |
---------- ---------- ---------- |
| .ano | | .wrp | | .ind | |
|----------| |----------| |----------| |
|annotation| | sequence | | index | |
---------- ---------- ---------- |
| \ | / |
| \ | / |
| \ | / |
| \ | / |
grep -n | \ | / |
| \ | / |
| | |
| | -------------------------------+
| ^ |
v | getob |
---------- ---------- v
| .grep | identify | .nam | ----------
|----------| --------->|----------| | .wrp |
| numbered | | LOCUS | ----------
|file lines| ---------- | eg. mRNA |
---------- | ^ | tRNA |
| | | rRNA |
| | | CDS |
--comm-- ----------
(logical operations on
sets of names)
Dr. Brian Fristensky
Dept. of Plant Science
University of Manitoba
Winnipeg, MB R3T 2N2 CANADA
204-474-6085
frist@cc.umanitoba.ca