fix: python3 migration bug

Update README.md
Delete sample_data directory
2024-01-15 13:17:50 +08:00 · 2022-03-05 14:47:34 +00:00 · 2022-03-05 14:42:46 +00:00 · 2022-03-05 14:42:35 +00:00 · 2022-03-05 14:42:17 +00:00 · 2022-03-05 14:42:09 +00:00
3 changed files with 674 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -1,15 +1,75 @@
 # tiger
 Identifying rapidly-evolving characters in evolutionary data

-## About Tiger
+## Usage
+```
+****************
+TIGER Help:
+****************

-TIGER is open source software for identifying rapidly evolving sites (columns in an alignment, or characters in a morphological dataset). It can deal with many kinds of data (molecular, morphological etc.). Sites like these are important to identify as they are very often removed or reweighted in order to improve phylogenetic reconstruction. When a site is changing very quickly between taxa it might not hold much phylogenetic information and therefore might simply be a source of noise. Use of TIGER can (a) allow you to see the amount of rapid evolution and noise in your alignment and (b) provide a quick and easy way to remove as many of the “noisy” sites as possible.
+TIGER: Tree-Independent Generation of Evolutionary Rates

-TIGER uses conflict between site patterns as a proxy for rapid evolution; that is, a site that does not conflict with other sites in the alignment is generally a very slowly evolving or constant site. A site with lots of conflict is considered rapidly evolving (Cummins & McInerney, Systematic Biology, 2011). TIGER rates the conflict and categorizes the sites based on the rates. In this software the categories are called bins and are user definable (see manual for further details). Bin1 will contain the constant sites and the bin with the highest number will contain the most rapidly evolving sites.
+(Developed by Carla Cummins in the lab of James Mc Inerney, NUI Maynooth, Co. Kildare, Ireland)
+
+-Options:
+
+-in        Specify input file. File must be in FastA format and must be aligned prior.
+           Datasets with uneven sequence lengths will return an error.
+
+-v         Returns current TIGER version.
+
+-f         Changes output formatting options.
+           -f s: sorts sites depending on their agreement score
+           -f r: displays rank values rather than bin numbers
+           -f s,r: displays sorted ranks (*Be sure to put only a "," NO SPACE!)
+           
+           Default prints bin numbers unsorted.
+
+-b         Set the number of bins to be used.
+           -b <int>: Sites will be placed into <int> number of bins. <int> is a whole number.
+
+           Default is 10
+
+-rl       A list of the rate at each site may be optionally written to a specified
+          file. 
+          -rl <file.txt> : writes list of the rates at each site to file.txt.
+
+-ptp      Specifies that a PTP test should be run. *Note: this option has a huge 
+          effect on running time!
+
+-z        Number of randomisations to be used for the PTP test. 
+          -z <int>: each site will be randomised <int> times. <int> is a whole number.
+
+	 Default is 100
+
+-p       Specify p-value which denotes significance in PTP test.
+         -p <float>: site will be denoted as significant if p-value is better than <float>.
+                     <float> is a floating point number.
+
+         Default is 0.05
+
+-pl      Write a list of p-values to a specified file.
+         -pl <file.txt>: writes list of p-values for each site to file.txt.
+
+-u       Specify unknown characters in the alignment. Unknown characters are omitted from 
+         site patterns and so are not considered in the analysis.
+         -u ?,-,*: defines ?, - and * as unknown characters. (*Be sure to put only a comma
+                   between characters, NO SPACE!!)
+         
+         Default is ? only
+
+```

 ## System Requirements

-TIGER is implemented in Python, so it should run on most computer platforms.
+- Python 3.x.

-For UNIX machines, a working version of Python 2.5 or 2.6 is required. On Windows machines, TIGER comes with everything it needs and requires no further installations.
+## Note

+Version 1.04 is made by Guoyi Zhang instead of other original authors. So, please cite this repository.
+
+## Citation
+
+- Cummins, C.A. and McInerney, J.O. (2011) A method for inferring the rate of evolution of homologous characters that can potentially improve phylogenetic inference, resolve deep divergence and correct systematic biases. Systematic Biology 60 (6) 833-844. doi: 10.1093/sysbio/syr064. 
+
+- Zhang G. (2022) TIGER version 1.04. https://github.com/starsareintherose/tiger
--- a/380
+++ b/380
@ -0,0 +1,380 @@
+#!/usr/bin/env python
+
+#*********TIGER v1.02*************
+
+#...read in options...
+import time
+lt_S = time.localtime()
+import sys
+import random
+import re
+from tiger_fns_102 import *
+options = sys.argv
+
+formRate = 0
+formSort = 0
+binNo = 10
+file = ""
+rate_file = ""
+write_rates = False
+numbered = False
+
+doPTP = False
+rands = 100
+pval = 0.05
+
+write_pvals = False
+pval_file = ""
+
+unknown = ["?"]
+
+if len(options) < 2:
+    printHelp()
+    sys.exit(0)
+
+for opt in range(len(options)):
+    if options[opt] == "-in":
+        file_name = options[opt + 1]
+        try:
+            file = open(file_name)
+        except IOError:
+            print ("File \"" + file_name + "\" not found...")
+            sys.exit(0)
+
+    elif options[opt] == "-b":
+        binNo = int(options[opt + 1])
+
+    elif options[opt] == "-f":
+        formOpts = options[opt + 1].split(",")
+        if "r" in formOpts:
+            formRate = 1
+        if "s" in formOpts:
+            formSort = 1
+        if "c" in formOpts:
+            numbered = True
+
+    elif options[opt] == "-v":
+        print ("TIGER version 1.02")
+        sys.exit(0)
+
+    elif options[opt] == "-rl":
+        write_rates = True
+        try:
+            rate_file = options[opt+1]
+            if rate_file[0] == "-":
+                print ("\n\nPlease specify file name for -rl option.\n\n")
+                sys.exit(0)
+        except IndexError:
+            print ("\n\nPlease specify file name for -rl option.\n\n")
+            sys.exit(0)
+
+    elif options[opt] == "-pl":
+        write_pvals = True
+        try:
+            pval_file = options[opt+1]
+            if pval_file[0] == "-":
+                print ("\n\nPlease specify file name for -pl option.\n\n")
+                sys.exit(0)
+        except IndexError:
+            print ("\n\nPlease specify file name for -pl option.\n\n")
+            sys.exit(0)
+
+    elif options[opt] == "-ptp":
+        doPTP = True
+
+    elif options[opt] == "-z":
+        rands = int(options[opt+1])
+
+    elif options[opt] == "-p":
+        pval = float(options[opt+1])
+
+    elif options[opt] == "-u":
+        unknown = options[opt+1].split(",")
+
+
+
+if not file:
+    print ("No file specified (-in option)")
+    sys.exit(0)
+
+
+#Parse .aln file
+names = []
+seqs = []
+if ">" in file.readline():
+    tmp = FastaParse(file_name)
+else:
+    print ("""
+    *******************************
+     File not in correct format!  
+     TIGER accepts FastA format.
+    *******************************
+    """)
+    sys.exit(0)
+
+import re
+names = tmp[0]
+for x, nm in (enumerate(names)):
+    names[x] = re.sub(" ", "", nm)
+seqs = tmp[1]
+
+lns = [len(l) for l in seqs]
+lns.sort()
+if lns[0] != lns[-1]:
+    print ("\n\nUneven sequence lengths. Ensure sequences have been aligned!\n\n")
+    sys.exit(0)
+
+
+datatype = DNAdetect(seqs[0]) 
+
+#Create array of site patterns
+patterns = []
+sites = []
+comp_Pats = []
+comp_Sites = []
+for i in range(len(seqs[0])):
+    sites.append("".join([j[i] for j in seqs]))
+    patterns.append(getPattern(sites[-1], unknown))
+    if patterns[-1] not in comp_Pats:
+        comp_Pats.append(patterns[-1])
+        comp_Sites.append(sites[-1])
+
+
+#Compare character patterns and score
+ranks = list(range(len(patterns))) #ranks = range(len(patterns))
+comp_ranks = []
+keep = []
+br = ""
+if doPTP:
+    comp_disagree = []
+    disagree = list(range(len(patterns))) #disagree = range(len(patterns))
+
+for aa, patA in enumerate(comp_Pats):
+    if re.search("\|", patA):
+        cfl = scoreConflict(aa, patA, patterns)
+        comp_ranks.append(1.0 - cfl)
+    else:
+        comp_ranks.append(1.0)
+
+    if doPTP:
+        dis = 0
+        for z in range(rands):
+            patJ = jumblePattern(comp_Sites[aa], unknown)
+            scr = scoreConflict(aa, patJ, patterns)
+            #check if site has higher conflict score than original site
+            if scr >= (1.0 - comp_ranks[aa]):
+                dis += 1
+
+        dis = float(dis)/float(rands)
+        comp_disagree.append(dis)
+
+    
+    i = -1
+    try:
+        while 1:
+            i = patterns.index(patA, i + 1)
+            ranks[i] = comp_ranks[aa]
+            if doPTP:
+                disagree[i] = comp_disagree[aa]
+    except ValueError:
+        pass
+
+#Output total alignment score to file alnRate.txt
+if write_rates:
+    handle = open(rate_file, 'w')
+    handle.write("\n".join([str(r) for r in ranks]))
+    handle.close()
+
+if write_pvals and not doPTP:
+    outF.write("-ptp option not selected. No p-values to write!")
+
+
+#Find significant sites
+if doPTP:
+    sig_dis = []
+    for x in range(len(disagree)):
+        if disagree[x] < pval:
+            sig_dis.append(x+1)
+
+    if write_pvals:
+        outF = open(pval_file, 'w')
+        for d in range(len(disagree)):
+            outF.write(str(disagree[d]))
+            if (d+1) in sig_dis:
+                outF.write(" *")
+            outF.write("\n")
+
+
+#BINNING
+maxRank = max(ranks)
+minRank = min(ranks)
+
+binStr = []
+for numb, r in enumerate(ranks):
+    if 1 > 0:
+        binParts = []
+        for mult in range(binNo + 1):
+            binParts.append(minRank + (((maxRank - minRank)/binNo)*mult))
+        binParts.reverse()
+        for bin in range(1, len(binParts)):
+            if r < binParts[bin - 1] and r >= binParts[bin]:
+                binStr.append(str(bin))
+                break
+            elif round(r, 5) == round(maxRank, 5):
+                binStr.append("1")
+                break
+
+
+#Make names equal lengths
+filler = " "*20
+filled_names = names[:]
+for n, nm in enumerate(names):
+    nm_r = re.sub("[\s]+", "_", nm)
+    if len(nm_r) > 20:
+        filled_names[n] = nm_r[1] + "_" + nm_r[-18:]
+    else:
+        filled_names[n] = nm_r+ filler[:(20 -len(nm_r))]
+
+
+#print in correct format.... (
+print ("#NEXUS\n\n[This file contains data that has been analysed for site specific rates]")
+print ("[using TIGER, developed by Carla Cummins in the laboratory of]")
+print ("[Dr James McInerney, National University of Ireland, Maynooth]\n\n")
+
+print ("[Histograms of number of sites in each category:]")       
+Hnames = []
+counts = []
+for b in range(binNo):
+    Hnames.append("Bin" + str(b+1) + "\t")
+    counts.append(binStr.count(str(b+1)))
+histogram(counts, Hnames)
+
+print ("\n\n")
+
+print ("\n\n\nBEGIN TAXA;")
+print ("\tDimensions NTax = "), len(seqs), ";"
+print ("\tTaxLabels "), " ".join(filled_names), ";\nEND;\n"
+
+print ("BEGIN CHARACTERS;")
+print ("\tDimensions nchar = "), len(seqs[0]), ";"
+print ("\tFormat datatype = "), datatype, " gap = - interleave;\nMatrix\n"
+
+
+sorted = list(range(len(ranks))) #sorted = range(len(seqs[0]))
+if formSort == 1:
+    sorted = range(len(ranks))
+    sr = ranks[:]
+    sr.sort()
+    sr.reverse()
+    s = 0
+    sortD = {}
+    for x in range(len(ranks)):
+        ind = sr.index(ranks[x])
+        sortD[x] = ind
+        sorted[ind] = x
+        sr[ind] = "|"
+
+    print (sortD)
+
+
+    if doPTP:
+        sig_sorted = []
+        for d in sig_dis:
+            sig_sorted.append(sortD[d-1]+1)
+        sig_dis = sig_sorted
+
+for xy in range(0, len(seqs[0]), 60):
+    for xz in range(len(seqs)):
+        ln = filled_names[xz] + "\t"
+        
+        if len(seqs[xz][xy:]) < 60:
+            for i in sorted[xy:]:
+                ln = ln + seqs[xz][i].upper()
+        else:
+            for j in sorted[xy:xy+60]:
+                ln = ln + seqs[xz][j].upper()
+
+        print (ln)
+    
+    if formRate == 0:
+        for x in range(len(str(binNo))):
+            if x == 0:
+                bnls = "[Bin Numbers        \t"
+            else:
+                bnls = "[" + filler + "\t"
+                
+            for y in range(xy, (xy + 60)):
+                if y < len(binStr):
+                    if len(binStr[sorted[y]]) - 1 < x:
+                        bnls = bnls + " "
+                    else:
+                        s = sorted[y]
+                        bnls = bnls + str(binStr[s])[x]
+                else:
+                    break
+            print (bnls + "]")
+
+    else:
+        for c in range(5):
+            if c == 0:
+                rtls = "[Rank Values        \t"
+            else:
+                rtls = "[" + filler + "\t"
+            for d in range(xy, (xy + 60)):
+                if d < len(ranks):
+                    if len(str(ranks[sorted[d]])) - 1 < c:
+                        rtls = rtls + " "
+                    else:
+                        rtls = rtls + str(ranks[sorted[d]])[c]
+                else:
+                    break
+            print (rtls + "]")
+
+    if numbered:
+        digits = len(str(len(sorted)))
+        srted = sorted[:]
+        for sn, sv in enumerate(srted):
+            if len(str(sv)) < digits:
+                srted[sn] = str(sv) + " "*(digits-len(str(sv)))
+        for c in range(digits):
+            if c == 0:
+                colnms = "[Column Numbers     \t"
+            else:
+                colnms = "[" + filler + "\t"
+            nms = [str(n)[c] for n in srted[xy:xy+60]]
+            colnms += "".join(nms)
+            print (colnms + "]")
+
+
+
+    print ("\n")
+print ("\n")
+
+print (";\nEND;\n\nBEGIN PAUP;")
+if formSort:
+    lower_bound = 1
+    for c in range(1, binNo + 1):
+        amt = binStr.count(str(c))
+        if amt > 0:
+            charset =  "\tCharset Bin" + str(c) + " = "
+            upper_bound = lower_bound + (binStr.count(str(c))) - 1
+            charset = charset + str(lower_bound) + "-" + str(upper_bound) + ";"
+            lower_bound = upper_bound + 1
+            print (charset)
+else:
+    for x in range(1, binNo + 1):
+        tmpL = []
+        if str(x) in binStr:
+            for y in range(len(binStr)):
+                if str(binStr[y]) == str(x):
+                    tmpL.append(str(y + 1))
+            print ("\tCharset Bin" + str(x) + " = ", " ".join(tmpL) + ";")
+
+if doPTP:
+    print ("\tCharset Sig_Disagreement = " + " ".join([str(i) for i in sig_dis]) + ";")
+
+print ("END;")
+
+lt_F = time.localtime()
+print ("[START TIME:", lt_S[3], ":", lt_S[4],":", lt_S[5], "]")
+print ("[FINISH TIME:", lt_F[3], ":", lt_F[4],":", lt_F[5], "]")
--- a/tiger_fns_102.py
+++ b/tiger_fns_102.py
@ -0,0 +1,229 @@
+#!/usr/bin/env python
+
+import re
+
+# DEFINE FUNCTIONS
+
+def shared(x, y):
+	sh = 0
+	grX = x.split("|")
+	grY = y.split("|")
+	
+	for i in range(len(grX)):
+		grX[i] = set(grX[i].split(","))
+	for j in range(len(grY)):
+		grY[j] = set(grY[j].split(","))
+		
+		
+	for xXx in grX:
+		for yYy in grY:
+			if xXx.issubset(yYy):
+				sh += 1
+				break
+				
+	return sh
+
+def scoreConflict(indA, patA, patterns):
+    vrai = True
+
+    dividand = 0
+
+    pg = patA.split("|")
+    score = 0.0
+    pats2 = patterns[:]
+    overall_rank = 0.0
+    conflict_score = 0.0
+    del pats2[indA]
+    for bb, patB in enumerate(pats2):
+        if re.search("\|", patB):
+            dividand += 1
+            compB = patB.split("|")
+            scB = 0.0
+            for group in compB:
+                cont = 1
+                if vrai:
+                    tax = group.split(",")
+                    ch1 = tax[0]
+                    for gp in pg:
+                        if ch1 in gp:
+                            for ch2 in tax[1:]:
+                                if ch2 not in gp and cont:
+                                    scB = scB + 1.0
+                                    cont = 0
+            score += (scB/len(compB))
+
+    conflict_score = score/dividand
+	
+    return conflict_score
+    
+def uniqify(seq):
+    seen = {}
+    result = []
+    for item in seq:
+        if item in seen : continue
+        seen[item] = 1
+        result.append(item)
+    return result
+
+
+def getPattern(site, unknown):
+    considered = []
+    pattern = []
+    for x in range(len(site)):
+        if site[x] not in unknown:
+            if site[x] in considered:
+                pattern[considered.index(site[x])].append(str(x)) 
+        else:
+            considered.append(site[x])
+            pattern.append([str(x)])
+
+
+    patStr = "|".join([",".join(g) for g in pattern])
+
+    return patStr
+
+
+def jumblePattern(site, unknown):
+	import random
+
+	siteJ = ""
+	while site:
+		pos = random.randrange(len(site))
+		siteJ += site[pos]
+		site = site[:pos] + site[(pos+1):]
+
+	return getPattern(siteJ, unknown)
+
+
+def DNAdetect(seq):
+    seq = seq.upper()
+    oLen = float(len(seq))
+    seq_C = ""
+    
+    seq_C = seq.replace("A", "")
+    seq_C = seq_C.replace("C", "")
+    seq_C = seq_C.replace("G", "")
+    seq_C = seq_C.replace("T", "")
+    
+    nLen = float(len(seq_C))
+    perc = (nLen/oLen)*100
+
+    if perc < 20.0:
+        return "DNA"
+    else:
+        seq_C = seq.replace("0", "")
+        seq_C = seq_C.replace("1", "")
+        if len(seq_C) == 0:
+            return "standard"
+        else:
+            return "protein"
+        
+
+def histogram(num_list, name_list):
+    upper = float(max(num_list))
+    pad = len(str(upper))
+    parts = []
+    for i in range(1,61):
+        parts.append((upper/60)*i)
+
+    for m, n in enumerate(num_list):
+        pr = name_list[m] + "|"
+        low = 0.0
+        if n == 0:
+            pr = pr + " "*61
+        for p, hi  in enumerate(parts):
+            if n > low and n <= hi:
+                pr = pr + "="*(p+1) + (" "*(60 - p))
+                break
+            low = hi
+        print ("[" + pr + "|" + str(n) + " "*(pad-len(str(n))) + "]")
+
+
+def FastaParse(file_name):
+    file = open(file_name)
+    names = []
+    seqs = []
+    for line in file:
+	    if ">" in line:
+		    names.append(line[1:].strip())
+		    seqs.append("")
+	    else:
+		    seqs[-1] += line.strip()
+
+    ret = [names, seqs]
+    return ret
+
+
+def printHelp():
+    print ("""
+****************
+TIGER Help:
+****************
+
+TIGER: Tree-Independent Generation of Evolutionary Rates
+
+(Developed by Carla Cummins in the lab of James Mc Inerney, NUI Maynooth, Co. Kildare, Ireland)
+
+-Options:
+
+-in        Specify input file. File must be in FastA format and must be aligned prior.
+           Datasets with uneven sequence lengths will return an error.
+
+-v         Returns current TIGER version.
+
+-f         Changes output formatting options.
+           -f s: sorts sites depending on their agreement score
+           -f r: displays rank values rather than bin numbers
+           -f s,r: displays sorted ranks (*Be sure to put only a "," NO SPACE!)
+           
+           Default prints bin numbers unsorted.
+
+-b         Set the number of bins to be used.
+           -b <int>: Sites will be placed into <int> number of bins. <int> is a whole number.
+
+           Default is 10
+
+-rl       A list of the rate at each site may be optionally written to a specified
+          file. 
+          -rl <file.txt> : writes list of the rates at each site to file.txt.
+
+-ptp      Specifies that a PTP test should be run. *Note: this option has a huge 
+          effect on running time!
+
+-z        Number of randomisations to be used for the PTP test. 
+          -z <int>: each site will be randomised <int> times. <int> is a whole number.
+
+	 Default is 100
+
+-p       Specify p-value which denotes significance in PTP test.
+         -p <float>: site will be denoted as significant if p-value is better than <float>.
+                     <float> is a floating point number.
+
+         Default is 0.05
+
+-pl      Write a list of p-values to a specified file.
+         -pl <file.txt>: writes list of p-values for each site to file.txt.
+
+-u       Specify unknown characters in the alignment. Unknown characters are omitted from 
+         site patterns and so are not considered in the analysis.
+         -u ?,-,*: defines ?, - and * as unknown characters. (*Be sure to put only a comma
+                   between characters, NO SPACE!!)
+         
+         Default is ? only
+
+
+-Examples:
+     
+1.   ./TIGER -in ExampleFile.aln -f s,r -v -rl rate_list.txt
+
+     This will run the software on "ExampleFile.aln", with sorted ranks included in the output.
+     The variability measure for each site will be displayed and a list of the rates at (unsorted)
+     sites will be written to the file "rate_list.txt".
+  
+2.   ./TIGER -in ExampleFile.aln -ptp -r 1000 -p 0.01 -u ?,*
+
+     This will run the software on the file "ExampleFile.aln" with a PTP test. Sites will be
+     randomised 1,000 times and pass the test if their p-value is <0.01. All ? and * characters
+     encountered in the alignment will be ommitted from the analysis.
+   
+     """)
Author	SHA1	Message	Date
kuoi	4037b84781	fix: python3 migration bug	2024-01-15 13:17:50 +08:00
kuoi	d21c1ca08d	Update README.md	2022-03-05 14:47:34 +00:00
kuoi	bc33912068	Delete sample_data directory	2022-03-05 14:42:46 +00:00
kuoi	74ad0ea3eb	Delete README	2022-03-05 14:42:35 +00:00
kuoi	049b611311	Delete TIGER_Manual.pdf	2022-03-05 14:42:17 +00:00
kuoi	0201b8df0f	Delete tiger_fns_102.pyc	2022-03-05 14:42:09 +00:00
kuoi	05484d37f8	update the README about the requirements	2022-02-09 00:25:38 +00:00
kuoi	5aa4ff3e6b	fix the python2to3 bugs	2022-02-09 00:19:14 +00:00
kuoi	b847b1f634	update to support python3	2022-02-05 20:34:05 +00:00
kuoi	8a3c5bab7e	add source	2022-01-30 01:07:09 +00:00