XYLEM.DOC update 10 Aug 1994 XYLEM: TOOLS FOR MANIPULATION OF GENETIC DATABASES Brian Fristensky, University of Manitoba Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003. SPLITDB - Splits files containing one or more GenBank entries into annotation, sequence, and index files. Indexfiles can also serve as namefiles for GETLOC. Sequence files are in the format required for use with the Pearson programs (FASTA,LFASTA etc.). GETLOC - Reads a file containing LOCUS names (namefile) and retrieves either annotation, sequence, or both from a split database or database subset created by SPLITDB. FETCH - A c-shell script that provides a convenient menu-driven front end for retrieval of database entries using GETLOC. FINDKEY - A c-shell script that provides a convenient menu-driven front end for keyword searches of database annotation files, using IDENTIFY. IDENTIFY- Given line-numbered output from grep, IDENTIFY uses the index file to determine which entries contained the keywords searched for by grep. It then produces a namefile for use by GETLOC. Namefiles can serve as logical databases, and utilities such as the Unix comm command can perform logical operations on these namefiles to produce database subsets. FEATURES/GETOB - Given a namefile, pulls objects (mRNA, tRNA, CDS etc.) from each of the named entries, using the new DDBJ/EMBL/GenBank International Features Table Format. A future version will also allow the annotation of sites within objects that are extracted. DBSTAT - Calculates amino acid frequencies in a protein database. RIBOSOME - Given a file of one or more nucleic acids (eg. output from GETOB) , RIBOSOME translates them into protein, using either the universal genetic code or an alternative genetic code supplied by the user. All ambiguities that can be resolved are translated. PROT2NUC - reverse translates a sequence from protein to nucleic acid, using IUPAC-IUB ambiguity codes. SHUFFLE - Given a random seed, shuffles each sequence in a Pearson- format (.wrp) file. Shuffling is done locally in overlapping windows across the length of a given sequence. The window size and overlap length can be specified by the user. REFORM - Reformats multiply aligned nucleic acid or protein sequences for publication. Output for M. Waterman's RALIGN program, or the MBCRR MASE editor, can be directly used as input. A variety of options are available for representing gaps, consensus sequences and other features. Fristensky (Cornell) Sequence Analysis Package - General purpose sequence analysis package written in Standard Pascal. Features include: sequence numbering, formatting, & translation, restriction site searches & mapping, matrix similarity searches, TESTCODE analysis, base composition analysis. All programs are interactive and read free-format, BIONET, and GenBank files. XYLEM DATABASE TOOLS ---------- | .gen | getloc |----------|<-------------------------- | GenBank | | ---------- | | | | splitgb | /|\ | / | \ | / | \ | / | \ | / | \ | / | \ | v v v | ---------- ---------- ---------- | | .ano | | .wrp | | .ind | | |----------| |----------| |----------| | |annotation| | sequence | | index | | ---------- ---------- ---------- | | \ | / | | \ | / | | \ | / | | \ | / | grep -n | \ | / | | \ | / | | | | | | -------------------------------+ | ^ | v | getob | ---------- ---------- v | .grep | identify | .nam | ---------- |----------| --------->|----------| | .wrp | | numbered | | LOCUS | ---------- |file lines| ---------- | eg. mRNA | ---------- | ^ | tRNA | | | | rRNA | | | | CDS | --comm-- ---------- (logical operations on sets of names) Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB R3T 2N2 CANADA 204-474-6085 frist@cc.umanitoba.ca