Reference Genome based Exon Phylogeny Pipeline

Find a file

Guoyi Zhang c91b0f8fb2 polish: clean unused path		2024-12-09 16:24:56 +11:00
CMakeLists.txt	fix: build system	2024-07-05 19:29:53 +10:00
config.example	polish: use buildPath instead of /; add: codon only	2024-12-09 15:23:55 +11:00
countTaxa.d	add: count fasta taxa from one folder	2024-09-15 16:12:47 +10:00
delstop.d	fix: delstop arg check	2024-09-10 02:25:16 +10:00
dub.sdl	add: dub.sdl multiple configuration	2024-09-18 17:06:34 +10:00
LICENSE.md	add: license	2024-07-05 15:51:42 +10:00
README.md	add: more details on cpp binary	2024-07-05 17:17:31 +10:00
RGBEPP.d	polish: clean unused path	2024-12-09 16:24:56 +11:00
sortdiamond.cpp	update: add more info	2024-07-05 19:35:14 +10:00
splitfasta.cpp	polish: splitfasta	2024-09-18 17:02:39 +10:00
splitfasta.d	add: splitfasta d version	2024-12-09 12:09:48 +11:00

README.md

RGB EPP

Reference Genome based Exon Phylogeny Pipeline

License: GPL-2.0-only

Author: Guoyi Zhang

Requirements

External software

GNU Bash (provide cd)
GNU coreutils (provide cp mv mkdir mv)
GNU findutils (provide find)
fastp
spades.py (provided by spades)
diamond
java
macse (default recognized path: /usr/share/java/macse.jar)
GNU parallel

Internal software

splitfasta (default recognized path: /usr/bin/splitfasta)
sortdiamond (default recognized path: /usr/bin/sortdiamond)

Arguments

Details

-c	--contigs	contings type: scaffolds or contigs
-g	--genes		gene file path
-f	--functions	functions type (optional): all clean 
	  		assembly fasta map pre split merge align
-h	--help		show this information
-l	--list		list file path
-m	--memory	memory settings (optional, default 16 GB)
-r	--reference	reference genome path
-t	--threads	threads setting (optional, default 8 threads)
	--macse		Macse jarfile path
	--sortdiamond	sortdiamond file path
	--splitfasta	splitfasta file path
for example: bash RGBEPP.sh -c scaffolds -f all -l list -g genes -r reference.aa.fasta

Directories Design

.
├── 00_raw
├── 01_fastp
├── 02_spades
├── 03_assemblied
├── 04_diamond
├── 05_pre
├── 06_split
├── 07_merge
├── 08_macse
├── genes
├── list
├── reference.aa.fasta
└── RGBEPP.sh

Each directory corresponds to each function.

00_raw should conatin all raw fastq.gz data.

Text Files

list is the text file containing all samples, if your raw data is following the style ${list_name}_R1.fastq.gz and ${list_name}_R2.fastq.gz, ${list_name} is what you should list in list file. The easy way to get it in Linux/Unix system is the following command

cd 00_raw
ls | sed "s@_R[12].fastq.gz@@g" > ../list
cd ..

genes is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command

grep '>' Reference.fasta | sed "s@>@@g" > genes

reference.aa.fasta can be replaced by another other name, but it must contain reference amino acids genome in fasta format

Process

RGBEPP.sh functions

Function clean: Quality control + trimming (fastp)
Function assembly: de novo assembly (spades)
Function fasta: gather all fasta files from assembly directories (RGBEPP.sh)
Function map: local nucleic acids alignment search against amino acids subject sequence (diamond)
Function pre: generate corresponding sequences based on blast-styled output (sortdiamond)
Function split: splitting fasta sequence to directories based on the reference genome (splitfasta)
Function merge: merge different taxa in the same reference exon gene to one fasta (RGBEPP.sh)
Function align: multiple sequence align based on Condon (macse)

Downstream process

concatenate sequences via SeqCombGo or catsequences or sequencematrix
coalescent / concatenated phylogeny

sortdiamond

Usage: sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)

Default sseq is column 2, qstart is column 8, etc.

Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6.

splitfasta

Usage: splitfasta sample.fasta

It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.