GETOB 21 Dec 94 NAME getob - Get an object from GenBank SYNOPSIS getob [-frcn] infile namefile anofile seqfile indfile message [outfile] expfile DESCRIPTION getob extracts 'objects' (subsequences) from GenBank entries, using the features table, and writes them to outfile (.out). A log describing the construction of each object is written to message (.msg). If -r is not set, a list of expressions that would recreate the .out file if evaluated by getob -r, is written to expfile (.exp) The following options are available: f Write each entry to a separate file. The name will consist of the entry name, and the extension '.obj'. r Resolve expressions from namefile into objects. Expressions take the form: @[::]: In effect, r makes it possible to use getob to resolve features that span more than one entry, such as segmented files. In the first run of the program, features that require data from outside the entry in which they are defined will be written to outfile with those externally-defined parts rep- resented using the '@' notation described above. During a subsequent run, the outfile from the previous run is used as namefile. When r is set, all lines not beginning with '@' (ie. name lines and sequence lines) are simply copied to the new outfile. When an '@' is encountered, the expression is parsed into accession number and location. The entry with the specified accession number is located in indfile, and read from anofile and seqfile. It is then evaluated, and the result written to outfile in place of the '@' expression. getob can also be used to get specific labeled objects from a given entry. Examples: @k30576:polyprotein @k30576:/label=polyprotein @x10345:/product="hsp70" @j00879:group(1..2200,mutation_37) The first two constructs given above are equivalent. Both will extract the feature called polyprotein. The third construct shows that any feature label can be specified. If none is specified, as in the first example, then /label= is assumed. One limitation, however, is that the label sought must be unique within the entry in its first 15 characters including double quotes ("). Otherwise, only the first matching label expression will be evaluated. Finally, the last example shows that a mutant sequence can be constructed by first specifying an expression that evaluates to a sequence (ie. 1..2200) and then a labeled expression that upon evaluation, uses replace() to modify that sequence. The usage shown in examples 3 & 4 above represent extensions to the DDBJ/EMBL/GenBank Features Table Format. As touched on briefly above, the r option makes it possible to construct objects that include recursive references to other entries (eg. segmented files) by iterative calls to getob. The 'features' command automates this process. The basic algorithm is as follows: getob infile namefile anofile seqfile indfile ... #Pull out all lines containing indirect references grep '@' outfile > unresolved.grep while (unresolved.grep is not empty) #extract accession numbers to be retrieved cut -c2-7 unresolved.grep > unresolved.nam #retrieve the sequences into a new file, and create #a database subset to be used by getob fetch unresolved.nam new.gen splitdb new.gen new.ano new.wrp new.ind #run getob again to resolve indirect references getob -r infile outfile new.ano new.wrp new.ind ... #Pull out all lines containing indirect references grep '@' outfile > unresolved.grep end c NAMEFILE contains accession numbers, rather than locus names n By default, the qualifier 'codon_start' is used to determine how many n's, if necessary, must be added to the 5' end of CDS, mat_peptide, or sig_peptide, to preserve the reading frame. To turn OFF this feature, -n must be set. -n must be set for GenBank Releases 67.0 and earlier. infile contains commands indicating what data is to be pulled from each entry. Two types of output may be presented, GenBank or OBJECTS. These are described below: 1) GenBank output - If the word 'GENBANK' is the first line in infile, a pseudo-GenBank entry will be recreated. This option is only intended for debugging purposes and will probably be removed in later releases. 2) Object format - This option instructs getob to write part or all of each sequence, along with site annotation, by specifying feature key names. The syntax for infile is shown below: Backus-Naur format: Example: ---------------------------------------------------------- OBJECTS OBJECTS tRNA { rRNA . . . SITES } stem_loop SITES { . . . } In the example above, getob is instructed to extract all tRNA or rRNA sequences from each entry, and annotate the position of each stem/loop structure. Note that the SITES coordinates written to the file tell the positions of those SITES relative to the start of the object, rather than the original location in the sequence. As above, each word begins a separate line. While the -r option does not use infile, at least a dummy infile must be included in the command line. This dummy file need only contain two lines: OBJECTS SITES NOTE: SITES IS NOT YET IMPLEMENTED! Although inclusion of SITES in the input file will have no effect, the word SITES must still be present after the last feature key. namefile namefile consists of a list of LOCUS names or accession numbers, each on a separate line. Names or accession numbers should appear in the order in which they appear in the database file. Unordered namefiles will slow the progress of the search. Since only the first non-blank field of each line in namefile is read, indfile could be used to create a namefile. If the entire indfile was used, the entire database file would be processed. A sample namefile requesting four sequences by LOCUS name is shown below: POTPR1A POTPSTH2 POTPSTH21 POTSTHA anofile, seqfile, and indfile The database subset containing GenBank entries must be divided among annotation, sequence and an index by splitdb. message message contains a log describing the parsing of each object. For annotative purposes, qualifier lines from the object are included in along with the location expression being parsed. The beginning of a typical message file is shown below: GETOB Version 0.962 14 May 1992 POTPR1A:CDS1 join ( 295 603 1011 1355 ) /note="pathogenesis-related protein (prp1)" /codon_start=1 /translation="MAEVKLLGLRYSPFSHRVEWALKIKGVKYEFIEEDLQNKSPLLL QSNPIHKKIPVLIHNGKCICESMVILEYIDEAFEGPSILPKDPYDRALARFWAKYVED KGAAVWKSFFSKGEEQEKAKEEAYEMLKILDNEFKDKKCFVGDKFGFADIVANGAALY LGILEEVSGIVLATSEKFPNFCAWRDEYCTQNEEYFPSRDELLIRYRAYIQPVDASK" //---------------------------------------------- In the example above, getob was instructed to retrieve all CDS features from the database subset. The message for the entry POTPR1A is shown, along with a reconstruction of the location expression that was evaluated to create the object. In this case, protien coding sequences from two exons had to be joined to create the object. outfile outfile contains the actual objects constructed, consisting of sites found and sequences. The beginning of a typical output file is shown below: >POTPR1A:CDS1 atggcagaagtgaagttgcttggtctaaggtatagtccttttagccatag agttgaatgggctctaaaaattaagggagtgaaatatgaatttatagagg aagatttacaaaataagagccctttacttcttcaatctaatccaattcac aagaaaattccagtgttaattcacaatggcaagtgcatttgtgagtctat ggtcattcttgaatacattgatgaggcatttgaaggcccttccattttgc ctaaagacccttatgatcgcgctttagcacgattttgggctaaatacgtc gaagataag ggggcagcagtgtggaaaagtttcttttcgaaaggagaggaacaagagaa agctaaagaggaagcttatgagatgttgaaaattcttgataatgagttca aggacaagaagtgctttgttggtgacaaatttggatttgctgatattgtt gcaaatggtgcagcactttatttgggaattcttgaagaagtatctggaat tgttttggcaacaagtgaaaaatttccaaatttttgtgcttggagagatg aatattgcacacaaaacgaggaatattttccttcaagagatgaattgctt atccgttaccgagcctacattcagcctgttgatgcttcaaaatga In the example, the CDS from entry POTPR1A has been written in two chunks, corresponding to the two exon portions of the coding sequence. Each location retrieved in constructing the object is written as a separate block of sequence. By comparing message file to outfile, it is possible to verify the correctness of the operation. Numbers are appended to the sequence names to indicate which CDS in the entry has been retrieved. Thus, if two CDS features were present, the second one would be named >POTPR1A:2. For compatiblility with the FASTA programs of Pearson, the name line begins with a '>'. expfile The expression evaluated to create this feature is written to expfile: >POTPR1A:CDS1 @J03679:join(295..603,1011..1355) expfile is only created if -r is not set. It is itended as a way of automating the creation of a feature expression file for use in generating customized datasets. Expressions in expfile can be deleted or modified, or new expressions added, to tailor the dataset to individual needs. To generate a dataset from expfile: getob -r infile expfile anofile seqfile indfile message outfile EXTENSIONS TO THE FEATURE TABLE LANGUAGE 1) poly(||,x) This operator evaluates an absolute location, literal, or feature name (ie. any location not containing functional operators) and writes it x times. The most obvious application of poly is to create spacers to represent regions of unknown sequence between sequences that are known. For example, the restriction map of a 4kb EcoR1 fragment with a Hind3 site 1000 bp from one end could be represented as follows: join("gaattc",poly("n",1000),"aagctt",poly("n",3000),"gaattc") 2) The following feature keys are recognized by GETOB, although not included in the language definition. While they will not appear in GenBank entries, they could be used in user-created GenBank-format files: contig This feature key is meant to be used to assemble large sequence segments from smaller segments, possibly using the poly() operator. chromosome Intended to annotate the complete sequence of a chromosome. This feature may be constructed by a join of two or more contigs. Use of these keywords is illustrated in the features table shown below, which could be used to construct a model of part of the E.coli chromosome, spanning map units 763.4 to 1031.4 kb: contig join(J01619:1..13063,poly("n",7140), J03939:1..1363,poly("n",14380), X02306:complement(1..1622),poly("n",14710), J04423:1..5793,poly("n",22500), X03722:1..2400,poly("n",123750), one-of(X05017:complement(1..1854),X05017:1..1854)) /label=Eco_contig8 /map=763.4-950.6kb contig join(V00352:1..2412,poly("n",28800),M15273:1..3409) /label=Eco_contig9 /map=972.9-1001.7kb contig join(X02826:1..1357,poly("n",13540), J01654:complement(1..2270)) /label=Eco_contig10 /map=1016.5-1031.4kb chromosome join(Eco_contig8,poly("n",22300), Eco_contig9,poly("n",14800), Eco_contig10) /label=Ecoli_chromosome NOTES 1) If the const DEBUG is set to true in the Pascal source code, getob writes messages to the standard output, indicating the progress of processing for each entry read in. By default, DEBUG=false. This feature is solely for debugging purposes and will be removed in later releases. 2) GETOB automatically expands leading blanks that have been compressed using splitdb -c. See splitdb.doc for more information. SEE ALSO features, splitdb, getloc The DDBJ/EMBL/GenBank Feature Table: Definition, Version 1.04 September 1, 1992 GenBank Release Notes for Release 79.0. AUTHOR Dr. Brian Fristensky Dept. of Plant Science University of Manitoba Winnipeg, MB Canada R3T 2N2 Phone: 204-474-6085 FAX: 204-261-5732 frist@cc.umanitoba.ca REFERENCE Fristensky, B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Research 21:5997-6003.