# RGB EPP Reference Genome based Exon Phylogeny Pipeline License: GPL-2.0-only Author: Guoyi Zhang ## Requirements ### External software - fastp - spades.py (provided by spades) - diamond - bowtie2 - samtools - bcftools - exonerate (optional, only for --codon) - java - macse (default recognized path: /usr/share/java/macse.jar) - trimal ### Internal software - sortdiamond (default recognized path: /usr/bin/sortdiamond) - delstop (default recognized path: /usr/bin/delstop) ## Arguments ### Details ``` -c --config config file for software path (optional) -g --genes gene file path (optional, if -r is specified) -f --functions functions type (optional): all clean assembly map postmap varcall consen codon align trim -h --help show this information -l --list list file path -m --memory memory settings (optional, default 16 GB) -r --reference reference genome path -t --threads threads setting (optional, default 8 threads) --codon Only use the codon region (optional) --fastp Fastp path (optional) --spades Spades python path (optional) --diamond Diamond python path (optional) --sortdiamond SortDiamond python path (optional) --bowtie2 Bowtie2 path (optional) --samtools Samtools path (optional) --bcftools Bcftools path (optional) --exonerate Exonerate path (optional) --macse Macse jarfile path (optional) --delstop Delstop path (optional) --trimal Trimal path (optional) for example: ./RGBEPP -f all -l list -t 8 -r reference.fasta ``` ### Directories Design ``` . ├── 00_raw ├── 01_fastp ├── 02_spades ├── 03_bowtie2 ├── 04_bam ├── 05_vcf ├── 06_consen ├── 07_macse ├── 08_trimal ├── list ├── gene ├── reference.aa.fasta └── RGBEPP ``` Each directory corresponds to each function. `00_raw` should conatin all raw fastq.gz data. ### Text Files `list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command ``` cd 00_raw ls | sed "s@_R[12].fastq.gz@@g" > ../list cd .. ``` `genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command ``` grep '>' Reference.fasta | sed "s@>@@g" > genes ``` `reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format ## Process ### RGBEPP functions - Function clean: Quality control + trimming (fastp) - Function assembly: de novo assembly (spades) - Function map: local nucleic acids alignment search against amino acids subject sequence (diamond, sortdiamond), mapping raw reads to its scaffolds sequences (bowtie2) - Function postmap: Sorting and marking the read read alignment (samtools) - Function varcall: variant calling and filtering (bcftools) - Function consen: get consensus fasta file from vcf files (bcftools), then sort sequences based on gene name and taxa name (RGBEPP) - Function codon (optional): only extract the exon sequence (exonerate) - Function align: multiple sequence align based on condon (macse) - Function trim: trimming based on codon (trimal, delstop) ### Arguments reuqirements for functions | Functions | -g/--gene | -l/--list | -r/--reference | | --------- | --------- | --------- | -------------- | | clean | | ✔ | | | assembly | | ✔ | | | map | | ✔ | ✔ | | postmap | | ✔ | | | varcall | | ✔ | | | consen | ✔ | ✔ | | | codon | ✔ | | ✔ | | align | ✔ | | | | trim | ✔ | | | ### Downstream process - concatenate sequences via SeqCombGo or catsequences or sequencematrix - coalescent / concatenated phylogeny ## Inner software ### sortdiamond Usage: `sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)` Default sseq is column 2, qstart is column 8, etc. Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6. ### delstop `delstop --delete` Delete StopCondon generated by Macse. fasta_aa and fasta_nt should be macse output files, `--delete` should be used when downstream software is tirmal ### splitfasta Usage: `splitfasta sample.fasta` It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.