Compare commits

...

10 Commits
1.02 ... main

Author SHA1 Message Date
kuoi 4037b84781 fix: python3 migration bug 2024-01-15 13:17:50 +08:00
kuoi d21c1ca08d
Update README.md 2022-03-05 14:47:34 +00:00
kuoi bc33912068
Delete sample_data directory 2022-03-05 14:42:46 +00:00
kuoi 74ad0ea3eb
Delete README 2022-03-05 14:42:35 +00:00
kuoi 049b611311
Delete TIGER_Manual.pdf 2022-03-05 14:42:17 +00:00
kuoi 0201b8df0f
Delete tiger_fns_102.pyc 2022-03-05 14:42:09 +00:00
kuoi 05484d37f8
update the README about the requirements 2022-02-09 00:25:38 +00:00
kuoi 5aa4ff3e6b fix the python2to3 bugs 2022-02-09 00:19:14 +00:00
kuoi b847b1f634 update to support python3 2022-02-05 20:34:05 +00:00
kuoi 8a3c5bab7e add source 2022-01-30 01:07:09 +00:00
3 changed files with 674 additions and 5 deletions

View File

@ -1,15 +1,75 @@
# tiger
Identifying rapidly-evolving characters in evolutionary data
## About Tiger
## Usage
```
****************
TIGER Help:
****************
TIGER is open source software for identifying rapidly evolving sites (columns in an alignment, or characters in a morphological dataset). It can deal with many kinds of data (molecular, morphological etc.). Sites like these are important to identify as they are very often removed or reweighted in order to improve phylogenetic reconstruction. When a site is changing very quickly between taxa it might not hold much phylogenetic information and therefore might simply be a source of noise. Use of TIGER can (a) allow you to see the amount of rapid evolution and noise in your alignment and (b) provide a quick and easy way to remove as many of the “noisy” sites as possible.
TIGER: Tree-Independent Generation of Evolutionary Rates
TIGER uses conflict between site patterns as a proxy for rapid evolution; that is, a site that does not conflict with other sites in the alignment is generally a very slowly evolving or constant site. A site with lots of conflict is considered rapidly evolving (Cummins & McInerney, Systematic Biology, 2011). TIGER rates the conflict and categorizes the sites based on the rates. In this software the categories are called bins and are user definable (see manual for further details). Bin1 will contain the constant sites and the bin with the highest number will contain the most rapidly evolving sites.
(Developed by Carla Cummins in the lab of James Mc Inerney, NUI Maynooth, Co. Kildare, Ireland)
-Options:
-in Specify input file. File must be in FastA format and must be aligned prior.
Datasets with uneven sequence lengths will return an error.
-v Returns current TIGER version.
-f Changes output formatting options.
-f s: sorts sites depending on their agreement score
-f r: displays rank values rather than bin numbers
-f s,r: displays sorted ranks (*Be sure to put only a "," NO SPACE!)
Default prints bin numbers unsorted.
-b Set the number of bins to be used.
-b <int>: Sites will be placed into <int> number of bins. <int> is a whole number.
Default is 10
-rl A list of the rate at each site may be optionally written to a specified
file.
-rl <file.txt> : writes list of the rates at each site to file.txt.
-ptp Specifies that a PTP test should be run. *Note: this option has a huge
effect on running time!
-z Number of randomisations to be used for the PTP test.
-z <int>: each site will be randomised <int> times. <int> is a whole number.
Default is 100
-p Specify p-value which denotes significance in PTP test.
-p <float>: site will be denoted as significant if p-value is better than <float>.
<float> is a floating point number.
Default is 0.05
-pl Write a list of p-values to a specified file.
-pl <file.txt>: writes list of p-values for each site to file.txt.
-u Specify unknown characters in the alignment. Unknown characters are omitted from
site patterns and so are not considered in the analysis.
-u ?,-,*: defines ?, - and * as unknown characters. (*Be sure to put only a comma
between characters, NO SPACE!!)
Default is ? only
```
## System Requirements
TIGER is implemented in Python, so it should run on most computer platforms.
- Python 3.x.
For UNIX machines, a working version of Python 2.5 or 2.6 is required. On Windows machines, TIGER comes with everything it needs and requires no further installations.
## Note
Version 1.04 is made by Guoyi Zhang instead of other original authors. So, please cite this repository.
## Citation
- Cummins, C.A. and McInerney, J.O. (2011) A method for inferring the rate of evolution of homologous characters that can potentially improve phylogenetic inference, resolve deep divergence and correct systematic biases. Systematic Biology 60 (6) 833-844. doi: 10.1093/sysbio/syr064.
- Zhang G. (2022) TIGER version 1.04. https://github.com/starsareintherose/tiger

380
tiger Normal file
View File

@ -0,0 +1,380 @@
#!/usr/bin/env python
#*********TIGER v1.02*************
#...read in options...
import time
lt_S = time.localtime()
import sys
import random
import re
from tiger_fns_102 import *
options = sys.argv
formRate = 0
formSort = 0
binNo = 10
file = ""
rate_file = ""
write_rates = False
numbered = False
doPTP = False
rands = 100
pval = 0.05
write_pvals = False
pval_file = ""
unknown = ["?"]
if len(options) < 2:
printHelp()
sys.exit(0)
for opt in range(len(options)):
if options[opt] == "-in":
file_name = options[opt + 1]
try:
file = open(file_name)
except IOError:
print ("File \"" + file_name + "\" not found...")
sys.exit(0)
elif options[opt] == "-b":
binNo = int(options[opt + 1])
elif options[opt] == "-f":
formOpts = options[opt + 1].split(",")
if "r" in formOpts:
formRate = 1
if "s" in formOpts:
formSort = 1
if "c" in formOpts:
numbered = True
elif options[opt] == "-v":
print ("TIGER version 1.02")
sys.exit(0)
elif options[opt] == "-rl":
write_rates = True
try:
rate_file = options[opt+1]
if rate_file[0] == "-":
print ("\n\nPlease specify file name for -rl option.\n\n")
sys.exit(0)
except IndexError:
print ("\n\nPlease specify file name for -rl option.\n\n")
sys.exit(0)
elif options[opt] == "-pl":
write_pvals = True
try:
pval_file = options[opt+1]
if pval_file[0] == "-":
print ("\n\nPlease specify file name for -pl option.\n\n")
sys.exit(0)
except IndexError:
print ("\n\nPlease specify file name for -pl option.\n\n")
sys.exit(0)
elif options[opt] == "-ptp":
doPTP = True
elif options[opt] == "-z":
rands = int(options[opt+1])
elif options[opt] == "-p":
pval = float(options[opt+1])
elif options[opt] == "-u":
unknown = options[opt+1].split(",")
if not file:
print ("No file specified (-in option)")
sys.exit(0)
#Parse .aln file
names = []
seqs = []
if ">" in file.readline():
tmp = FastaParse(file_name)
else:
print ("""
*******************************
File not in correct format!
TIGER accepts FastA format.
*******************************
""")
sys.exit(0)
import re
names = tmp[0]
for x, nm in (enumerate(names)):
names[x] = re.sub(" ", "", nm)
seqs = tmp[1]
lns = [len(l) for l in seqs]
lns.sort()
if lns[0] != lns[-1]:
print ("\n\nUneven sequence lengths. Ensure sequences have been aligned!\n\n")
sys.exit(0)
datatype = DNAdetect(seqs[0])
#Create array of site patterns
patterns = []
sites = []
comp_Pats = []
comp_Sites = []
for i in range(len(seqs[0])):
sites.append("".join([j[i] for j in seqs]))
patterns.append(getPattern(sites[-1], unknown))
if patterns[-1] not in comp_Pats:
comp_Pats.append(patterns[-1])
comp_Sites.append(sites[-1])
#Compare character patterns and score
ranks = list(range(len(patterns))) #ranks = range(len(patterns))
comp_ranks = []
keep = []
br = ""
if doPTP:
comp_disagree = []
disagree = list(range(len(patterns))) #disagree = range(len(patterns))
for aa, patA in enumerate(comp_Pats):
if re.search("\|", patA):
cfl = scoreConflict(aa, patA, patterns)
comp_ranks.append(1.0 - cfl)
else:
comp_ranks.append(1.0)
if doPTP:
dis = 0
for z in range(rands):
patJ = jumblePattern(comp_Sites[aa], unknown)
scr = scoreConflict(aa, patJ, patterns)
#check if site has higher conflict score than original site
if scr >= (1.0 - comp_ranks[aa]):
dis += 1
dis = float(dis)/float(rands)
comp_disagree.append(dis)
i = -1
try:
while 1:
i = patterns.index(patA, i + 1)
ranks[i] = comp_ranks[aa]
if doPTP:
disagree[i] = comp_disagree[aa]
except ValueError:
pass
#Output total alignment score to file alnRate.txt
if write_rates:
handle = open(rate_file, 'w')
handle.write("\n".join([str(r) for r in ranks]))
handle.close()
if write_pvals and not doPTP:
outF.write("-ptp option not selected. No p-values to write!")
#Find significant sites
if doPTP:
sig_dis = []
for x in range(len(disagree)):
if disagree[x] < pval:
sig_dis.append(x+1)
if write_pvals:
outF = open(pval_file, 'w')
for d in range(len(disagree)):
outF.write(str(disagree[d]))
if (d+1) in sig_dis:
outF.write(" *")
outF.write("\n")
#BINNING
maxRank = max(ranks)
minRank = min(ranks)
binStr = []
for numb, r in enumerate(ranks):
if 1 > 0:
binParts = []
for mult in range(binNo + 1):
binParts.append(minRank + (((maxRank - minRank)/binNo)*mult))
binParts.reverse()
for bin in range(1, len(binParts)):
if r < binParts[bin - 1] and r >= binParts[bin]:
binStr.append(str(bin))
break
elif round(r, 5) == round(maxRank, 5):
binStr.append("1")
break
#Make names equal lengths
filler = " "*20
filled_names = names[:]
for n, nm in enumerate(names):
nm_r = re.sub("[\s]+", "_", nm)
if len(nm_r) > 20:
filled_names[n] = nm_r[1] + "_" + nm_r[-18:]
else:
filled_names[n] = nm_r+ filler[:(20 -len(nm_r))]
#print in correct format.... (
print ("#NEXUS\n\n[This file contains data that has been analysed for site specific rates]")
print ("[using TIGER, developed by Carla Cummins in the laboratory of]")
print ("[Dr James McInerney, National University of Ireland, Maynooth]\n\n")
print ("[Histograms of number of sites in each category:]")
Hnames = []
counts = []
for b in range(binNo):
Hnames.append("Bin" + str(b+1) + "\t")
counts.append(binStr.count(str(b+1)))
histogram(counts, Hnames)
print ("\n\n")
print ("\n\n\nBEGIN TAXA;")
print ("\tDimensions NTax = "), len(seqs), ";"
print ("\tTaxLabels "), " ".join(filled_names), ";\nEND;\n"
print ("BEGIN CHARACTERS;")
print ("\tDimensions nchar = "), len(seqs[0]), ";"
print ("\tFormat datatype = "), datatype, " gap = - interleave;\nMatrix\n"
sorted = list(range(len(ranks))) #sorted = range(len(seqs[0]))
if formSort == 1:
sorted = range(len(ranks))
sr = ranks[:]
sr.sort()
sr.reverse()
s = 0
sortD = {}
for x in range(len(ranks)):
ind = sr.index(ranks[x])
sortD[x] = ind
sorted[ind] = x
sr[ind] = "|"
print (sortD)
if doPTP:
sig_sorted = []
for d in sig_dis:
sig_sorted.append(sortD[d-1]+1)
sig_dis = sig_sorted
for xy in range(0, len(seqs[0]), 60):
for xz in range(len(seqs)):
ln = filled_names[xz] + "\t"
if len(seqs[xz][xy:]) < 60:
for i in sorted[xy:]:
ln = ln + seqs[xz][i].upper()
else:
for j in sorted[xy:xy+60]:
ln = ln + seqs[xz][j].upper()
print (ln)
if formRate == 0:
for x in range(len(str(binNo))):
if x == 0:
bnls = "[Bin Numbers \t"
else:
bnls = "[" + filler + "\t"
for y in range(xy, (xy + 60)):
if y < len(binStr):
if len(binStr[sorted[y]]) - 1 < x:
bnls = bnls + " "
else:
s = sorted[y]
bnls = bnls + str(binStr[s])[x]
else:
break
print (bnls + "]")
else:
for c in range(5):
if c == 0:
rtls = "[Rank Values \t"
else:
rtls = "[" + filler + "\t"
for d in range(xy, (xy + 60)):
if d < len(ranks):
if len(str(ranks[sorted[d]])) - 1 < c:
rtls = rtls + " "
else:
rtls = rtls + str(ranks[sorted[d]])[c]
else:
break
print (rtls + "]")
if numbered:
digits = len(str(len(sorted)))
srted = sorted[:]
for sn, sv in enumerate(srted):
if len(str(sv)) < digits:
srted[sn] = str(sv) + " "*(digits-len(str(sv)))
for c in range(digits):
if c == 0:
colnms = "[Column Numbers \t"
else:
colnms = "[" + filler + "\t"
nms = [str(n)[c] for n in srted[xy:xy+60]]
colnms += "".join(nms)
print (colnms + "]")
print ("\n")
print ("\n")
print (";\nEND;\n\nBEGIN PAUP;")
if formSort:
lower_bound = 1
for c in range(1, binNo + 1):
amt = binStr.count(str(c))
if amt > 0:
charset = "\tCharset Bin" + str(c) + " = "
upper_bound = lower_bound + (binStr.count(str(c))) - 1
charset = charset + str(lower_bound) + "-" + str(upper_bound) + ";"
lower_bound = upper_bound + 1
print (charset)
else:
for x in range(1, binNo + 1):
tmpL = []
if str(x) in binStr:
for y in range(len(binStr)):
if str(binStr[y]) == str(x):
tmpL.append(str(y + 1))
print ("\tCharset Bin" + str(x) + " = ", " ".join(tmpL) + ";")
if doPTP:
print ("\tCharset Sig_Disagreement = " + " ".join([str(i) for i in sig_dis]) + ";")
print ("END;")
lt_F = time.localtime()
print ("[START TIME:", lt_S[3], ":", lt_S[4],":", lt_S[5], "]")
print ("[FINISH TIME:", lt_F[3], ":", lt_F[4],":", lt_F[5], "]")

229
tiger_fns_102.py Normal file
View File

@ -0,0 +1,229 @@
#!/usr/bin/env python
import re
# DEFINE FUNCTIONS
def shared(x, y):
sh = 0
grX = x.split("|")
grY = y.split("|")
for i in range(len(grX)):
grX[i] = set(grX[i].split(","))
for j in range(len(grY)):
grY[j] = set(grY[j].split(","))
for xXx in grX:
for yYy in grY:
if xXx.issubset(yYy):
sh += 1
break
return sh
def scoreConflict(indA, patA, patterns):
vrai = True
dividand = 0
pg = patA.split("|")
score = 0.0
pats2 = patterns[:]
overall_rank = 0.0
conflict_score = 0.0
del pats2[indA]
for bb, patB in enumerate(pats2):
if re.search("\|", patB):
dividand += 1
compB = patB.split("|")
scB = 0.0
for group in compB:
cont = 1
if vrai:
tax = group.split(",")
ch1 = tax[0]
for gp in pg:
if ch1 in gp:
for ch2 in tax[1:]:
if ch2 not in gp and cont:
scB = scB + 1.0
cont = 0
score += (scB/len(compB))
conflict_score = score/dividand
return conflict_score
def uniqify(seq):
seen = {}
result = []
for item in seq:
if item in seen : continue
seen[item] = 1
result.append(item)
return result
def getPattern(site, unknown):
considered = []
pattern = []
for x in range(len(site)):
if site[x] not in unknown:
if site[x] in considered:
pattern[considered.index(site[x])].append(str(x))
else:
considered.append(site[x])
pattern.append([str(x)])
patStr = "|".join([",".join(g) for g in pattern])
return patStr
def jumblePattern(site, unknown):
import random
siteJ = ""
while site:
pos = random.randrange(len(site))
siteJ += site[pos]
site = site[:pos] + site[(pos+1):]
return getPattern(siteJ, unknown)
def DNAdetect(seq):
seq = seq.upper()
oLen = float(len(seq))
seq_C = ""
seq_C = seq.replace("A", "")
seq_C = seq_C.replace("C", "")
seq_C = seq_C.replace("G", "")
seq_C = seq_C.replace("T", "")
nLen = float(len(seq_C))
perc = (nLen/oLen)*100
if perc < 20.0:
return "DNA"
else:
seq_C = seq.replace("0", "")
seq_C = seq_C.replace("1", "")
if len(seq_C) == 0:
return "standard"
else:
return "protein"
def histogram(num_list, name_list):
upper = float(max(num_list))
pad = len(str(upper))
parts = []
for i in range(1,61):
parts.append((upper/60)*i)
for m, n in enumerate(num_list):
pr = name_list[m] + "|"
low = 0.0
if n == 0:
pr = pr + " "*61
for p, hi in enumerate(parts):
if n > low and n <= hi:
pr = pr + "="*(p+1) + (" "*(60 - p))
break
low = hi
print ("[" + pr + "|" + str(n) + " "*(pad-len(str(n))) + "]")
def FastaParse(file_name):
file = open(file_name)
names = []
seqs = []
for line in file:
if ">" in line:
names.append(line[1:].strip())
seqs.append("")
else:
seqs[-1] += line.strip()
ret = [names, seqs]
return ret
def printHelp():
print ("""
****************
TIGER Help:
****************
TIGER: Tree-Independent Generation of Evolutionary Rates
(Developed by Carla Cummins in the lab of James Mc Inerney, NUI Maynooth, Co. Kildare, Ireland)
-Options:
-in Specify input file. File must be in FastA format and must be aligned prior.
Datasets with uneven sequence lengths will return an error.
-v Returns current TIGER version.
-f Changes output formatting options.
-f s: sorts sites depending on their agreement score
-f r: displays rank values rather than bin numbers
-f s,r: displays sorted ranks (*Be sure to put only a "," NO SPACE!)
Default prints bin numbers unsorted.
-b Set the number of bins to be used.
-b <int>: Sites will be placed into <int> number of bins. <int> is a whole number.
Default is 10
-rl A list of the rate at each site may be optionally written to a specified
file.
-rl <file.txt> : writes list of the rates at each site to file.txt.
-ptp Specifies that a PTP test should be run. *Note: this option has a huge
effect on running time!
-z Number of randomisations to be used for the PTP test.
-z <int>: each site will be randomised <int> times. <int> is a whole number.
Default is 100
-p Specify p-value which denotes significance in PTP test.
-p <float>: site will be denoted as significant if p-value is better than <float>.
<float> is a floating point number.
Default is 0.05
-pl Write a list of p-values to a specified file.
-pl <file.txt>: writes list of p-values for each site to file.txt.
-u Specify unknown characters in the alignment. Unknown characters are omitted from
site patterns and so are not considered in the analysis.
-u ?,-,*: defines ?, - and * as unknown characters. (*Be sure to put only a comma
between characters, NO SPACE!!)
Default is ? only
-Examples:
1. ./TIGER -in ExampleFile.aln -f s,r -v -rl rate_list.txt
This will run the software on "ExampleFile.aln", with sorted ranks included in the output.
The variability measure for each site will be displayed and a list of the rates at (unsorted)
sites will be written to the file "rate_list.txt".
2. ./TIGER -in ExampleFile.aln -ptp -r 1000 -p 0.01 -u ?,*
This will run the software on the file "ExampleFile.aln" with a PTP test. Sites will be
randomised 1,000 times and pass the test if their p-value is <0.01. All ? and * characters
encountered in the alignment will be ommitted from the analysis.
""")