229 lines
9.1 KiB
Text
229 lines
9.1 KiB
Text
|
|
* ReadSeq.Help -- 30 Dec 92
|
|
*
|
|
* Reads and writes nucleic/protein sequences in various
|
|
* formats. Data files may have multiple sequences.
|
|
*
|
|
* Copyright 1990 by d.g.gilbert
|
|
* biology dept., indiana university, bloomington, in 47405
|
|
* e-mail: gilbertd@bio.indiana.edu
|
|
*
|
|
* This program may be freely copied and used by anyone.
|
|
* Developers are encourged to incorporate parts in their
|
|
* programs, rather than devise their own private sequence
|
|
* format.
|
|
*
|
|
* This should compile and run with any ANSI C compiler.
|
|
* Please advise me of any bugs, additions or corrections.
|
|
|
|
Readseq is particularly useful as it automatically detects many
|
|
sequence formats, and interconverts among them.
|
|
|
|
Formats which readseq currently understands:
|
|
|
|
* IG/Stanford, used by Intelligenetics and others
|
|
* GenBank/GB, genbank flatfile format
|
|
* NBRF format
|
|
* EMBL, EMBL flatfile format
|
|
* GCG, single sequence format of GCG software
|
|
* DNAStrider, for common Mac program
|
|
* Fitch format, limited use
|
|
* Pearson/Fasta, a common format used by Fasta programs and others
|
|
* Zuker format, limited use. Input only.
|
|
* Olsen, format printed by Olsen VMS sequence editor. Input only.
|
|
* Phylip3.2, sequential format for Phylip programs
|
|
* Phylip, interleaved format for Phylip programs (v3.3, v3.4)
|
|
* Plain/Raw, sequence data only (no name, document, numbering)
|
|
+ MSF multi sequence format used by GCG software
|
|
+ PAUP's multiple sequence (NEXUS) format
|
|
+ PIR/CODATA format used by PIR
|
|
+ ASN.1 format used by NCBI
|
|
+ Pretty print with various options for nice looking output. Output only.
|
|
|
|
See the included "Formats" file for detail on file formats.
|
|
|
|
|
|
Example usage:
|
|
readseq
|
|
-- for interactive use
|
|
|
|
readseq my.1st.seq my.2nd.seq -all -format=genbank -output=my.gb
|
|
-- convert all of two input files to one genbank format output file
|
|
|
|
readseq my.seq -all -form=pretty -nameleft=3 -numleft -numright -numtop -match
|
|
-- output to standard output a file in a pretty format
|
|
|
|
readseq my.seq -item=9,8,3,2 -degap -CASE -rev -f=msf -out=my.rev
|
|
-- select 4 items from input, degap, reverse, and uppercase them
|
|
|
|
cat *.seq | readseq -pipe -all -format=asn > bunch-of.asn
|
|
-- pipe a bunch of data thru readseq, converting all to asn
|
|
|
|
|
|
The brief usage of readseq is as follows. The "[]" denote
|
|
optional parts of the syntax:
|
|
|
|
readseq -help
|
|
readSeq (27Dec92), multi-format molbio sequence reader.
|
|
usage: readseq [-options] in.seq > out.seq
|
|
options
|
|
-a[ll] select All sequences
|
|
-c[aselower] change to lower case
|
|
-C[ASEUPPER] change to UPPER CASE
|
|
-degap[=-] remove gap symbols
|
|
-i[tem=2,3,4] select Item number(s) from several
|
|
-l[ist] List sequences only
|
|
-o[utput=]out.seq redirect Output
|
|
-p[ipe] Pipe (command line, <stdin, >stdout)
|
|
-r[everse] change to Reverse-complement
|
|
-v[erbose] Verbose progress
|
|
-f[ormat=]# Format number for output, or
|
|
-f[ormat=]Name Format name for output:
|
|
1. IG/Stanford 10. Olsen (in-only)
|
|
2. GenBank/GB 11. Phylip3.2
|
|
3. NBRF 12. Phylip
|
|
4. EMBL 13. Plain/Raw
|
|
5. GCG 14. PIR/CODATA
|
|
6. DNAStrider 15. MSF
|
|
7. Fitch 16. ASN.1
|
|
8. Pearson/Fasta 17. PAUP
|
|
9. Zuker 18. Pretty (out-only)
|
|
|
|
Pretty format options:
|
|
-wid[th]=# sequence line width
|
|
-tab=# left indent
|
|
-col[space]=# column space within sequence line on output
|
|
-gap[count] count gap chars in sequence numbers
|
|
-nameleft, -nameright[=#] name on left/right side [=max width]
|
|
-nametop name at top/bottom
|
|
-numleft, -numright seq index on left/right side
|
|
-numtop, -numbot index on top/bottom
|
|
-match[=.] use match base for 2..n species
|
|
-inter[line=#] blank line(s) between sequence blocks
|
|
|
|
|
|
Notes:
|
|
|
|
In use, readseq will respond to command line arguments, or to
|
|
interactive use. Command line arguments cannot be combined
|
|
but must each follow a switch character (-). In this release,
|
|
the command line options are now words, with an equals (=)
|
|
to separate parameter(s) fromt he command. You cannot put a
|
|
space between a command and its parameter, as is usual for
|
|
Unix programs (this is to preserve compatibility with VMS).
|
|
The command line syntax of the earlier versions is still
|
|
supported.
|
|
|
|
See the file Formats for details of the sequence formats which
|
|
are supported by readseq. The auto-detection feature of
|
|
readseq which distinguishes these formats looks for some of the
|
|
unique keywords and symbols that are found in each format. It
|
|
is not infallible at this, though it attempts to exclude unknown
|
|
formats. In general, if you feed to readseq a sequence file that
|
|
you know is one of these common formats, you are okay. If you feed
|
|
it data that might be oddball formats, or non-sequence data,
|
|
you might well get garbage results. Also, different developers
|
|
are always thinking up minor twists on these common formats
|
|
(like PAUP requiring a blank line between blocks of Phylip format,
|
|
or IG adding form feeds between sequences), which may cause hassles.
|
|
|
|
In general, output supports only minimal subsets of each format
|
|
needed for sequence data exchanges. Features, descriptions
|
|
and other format-unique information is discarded.
|
|
|
|
The pretty format requires additional options to generate a
|
|
nice output. Try the various pretty options to see what you like.
|
|
Pretty format is OUPUT only, readseq cannot read a Pretty format
|
|
file.
|
|
|
|
Readseq is NOT optimized for LARGE files. It generally makes several
|
|
reads thru each input file (one per sequence output at present, future
|
|
version may optimize this). It should handle input and output files
|
|
and sequences of any size, but will slow down quite a bit for very large
|
|
(multi megabyte) sized files. It is NOT recommended for converting
|
|
databanks or large subsets there-of. It is primarily directed at the
|
|
small files that researchers use to maintain their personal data, which
|
|
they frequently need to interconvert for the various analysis programs
|
|
which so frequently require a special format.
|
|
|
|
Users of Olsen multi sequence editor (VMS). The Olsen format
|
|
here is produced with the print command:
|
|
print/out=some.file
|
|
Use Genbank output from readseq to produce a format that this
|
|
editor can read, and use the command
|
|
load/genbank some.file
|
|
Dan Davison has a VMS program that will convert to/from the
|
|
Olsen native binary data format. E-mail davison@uh.edu
|
|
|
|
Warning: Phylip format input is now supported (30Dec92), however the
|
|
auto-detection of Phylip format is very probabilistic and messy,
|
|
especially distinguishing sequential from interleaved versions. It
|
|
is not recommended that one use readseq to convert files from Phylip
|
|
format to others unless essential.
|
|
|
|
|
|
This program is available thru Internet gopher, as
|
|
|
|
gopher ftp.bio.indiana.edu
|
|
browse into the IUBio-Software+Data/molbio/readseq/ folder
|
|
select the readseq.shar document
|
|
|
|
Or thru anonymous FTP in this manner:
|
|
my_computer> ftp ftp.bio.indiana.edu (or IP address 129.79.224.25)
|
|
username: anonymous
|
|
password: my_username@my_computer
|
|
ftp> cd molbio/readseq
|
|
ftp> get readseq.shar
|
|
ftp> bye
|
|
|
|
readseq.shar is a Unix shell archive of the readseq files.
|
|
This file can be editted by any text editor to reconstitute the
|
|
original files, for those who do not have a Unix system or an
|
|
Unshar program. Read the top of this .shar file for further
|
|
instructions.
|
|
|
|
There are also pre-compiled executables for the following computers:
|
|
Silicon Graphics Iris, Sparc (Sun Sparcstation & clones), VMS-Vax,
|
|
Macintosh. Use binary ftp to transfer these, except Macintosh. The
|
|
Mac version is just the command-line program in a window, not very
|
|
handy.
|
|
|
|
C source files:
|
|
readseq.c ureadseq.c ureadasn.c ureadseq.h
|
|
|
|
Document files:
|
|
Readme (this doc)
|
|
Formats (description of sequence file formats)
|
|
add.gdemenu (GDE program users can add this to the .GDEmenu file)
|
|
Stdfiles -- test sequence files
|
|
Makefile -- Unix make file
|
|
Make.com -- VMS make file
|
|
*.std -- files for testing validity of readseq
|
|
|
|
|
|
Recent changes (see also readseq.c for all history of changes):
|
|
|
|
4 May 92
|
|
+ added 32 bit CRC checksum as alternative to GCG 6.5bit checksum
|
|
Aug 92
|
|
= fixed Olsen format input to handle files w/ more sequences,
|
|
not to mess up when more than one seq has same identifier,
|
|
and to convert number masks to symbols.
|
|
= IG format fix to understand ^L
|
|
30 Dec 92
|
|
* revised command-line & interactive interface. Suggested form is now
|
|
readseq infile -format=genbank -output=outfile -item=1,3,4 ...
|
|
but remains compatible with prior commandlines:
|
|
readseq infile -f2 -ooutfile -i3 ...
|
|
+ added GCG MSF multi sequence file format
|
|
+ added PIR/CODATA format
|
|
+ added NCBI ASN.1 sequence file format
|
|
+ added Pretty, multi sequence pretty output (only)
|
|
+ added PAUP multi seq format
|
|
+ added degap option
|
|
+ added Gary Williams (GWW, G.Williams@CRC.AC.UK) reverse-complement option.
|
|
+ added support for reading Phylip formats (interleave & sequential)
|
|
* string fixes, dropped need for compiler flags NOSTR, FIXTOUPPER, NEEDSTRCASECMP
|
|
* changed 32bit checksum to default, -DSMALLCHECKSUM for GCG version
|
|
|
|
|