RGBEPP/README.md

# RGB EPP

Reference Genome based Exon Phylogeny Pipeline

License: GPL-2.0-only

Author: Guoyi Zhang

## Requirements

### External software 

- fastp
- spades.py (provided by spades)
- diamond
- bowtie2
- samtools
- bcftools
- exonerate (optional, only for --codon)
- java
- macse (default recognized path: /usr/share/java/macse.jar)
- trimal

### Internal software

- sortdiamond (default recognized path: /usr/bin/sortdiamond)
- delstop (default recognized path: /usr/bin/delstop)

## Arguments

### Details

```
	    -c	--config	config file for software path (optional)
	    -g	--genes		gene file path (optional, if -r is specified)
	    -f	--functions	functions type (optional): all clean assembly 
	      	           	 map postmap varcall consen codon align trim
	    -h	--help		show this information
	    -l	--list		list file path
	    -m	--memory	memory settings (optional, default 16 GB)
	    -r	--reference	reference genome path
	    -t	--threads	threads setting (optional, default 8 threads)
	    --codon		Only use the codon region (optional)
	    --fastp		Fastp path (optional)
	    --spades		Spades python path (optional)
	    --diamond		Diamond python path (optional)
	    --sortdiamond	SortDiamond python path (optional)
	    --bowtie2		Bowtie2 path (optional)
	    --samtools		Samtools path (optional)
	    --bcftools		Bcftools path (optional)
	    --exonerate		Exonerate path (optional)
	    --macse		Macse jarfile path (optional)
	    --delstop		Delstop path (optional)
	    --trimal		Trimal path (optional)
	    for example: ./RGBEPP -f all -l list -t 8 -r reference.fasta 
```

### Directories Design

```
.
├── 00_raw
├── 01_fastp
├── 02_spades
├── 03_bowtie2
├── 04_bam
├── 05_vcf
├── 06_consen
├── 07_macse
├── 08_macse
├── 08_trimal
├── list
├── gene
├── reference.aa.fasta
└── RGBEPP
```

Each directory corresponds to each function.

`00_raw` should conatin all raw fastq.gz data.

### Text Files

`list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and  ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command

```
cd 00_raw
ls | sed "s@_R[12].fastq.gz@@g" > ../list
cd ..
```

`genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command

```
grep '>' Reference.fasta | sed "s@>@@g" > genes
```

`reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format

## Process

### RGBEPP functions

 - Function clean: Quality control + trimming (fastp)
 - Function assembly: de novo assembly (spades)
 - Function map: local nucleic acids alignment search against amino acids subject sequence (diamond, sortdiamond), mapping raw reads to its scaffolds sequences (bowtie2) 
 - Function postmap: Sorting and marking the read read alignment (samtools)
 - Function varcall: variant calling and filtering (bcftools) 
 - Function consen: get consensus fasta file from vcf files (bcftools), then sort sequences based on gene name and taxa name (RGBEPP)
 - Function codon (optional): only extract the exon sequence (exonerate)
 - Function align: multiple sequence align based on condon (macse)
 - Function trim: trimming based on codon (trimal, delstop)

### Arguments reuqirements for functions

| Functions | -g/--gene | -l/--list | -r/--reference |
| --------- | --------- | --------- | -------------- | 
| clean | | ✔ | |
| assembly | | ✔ | |
| map | | ✔ | ✔ |
| postmap | | ✔ | |
| varcall | | ✔ | |
| consen | ✔ | ✔ | |
| codon | ✔ | | ✔ |
| align | ✔ | | |
| trim | ✔ | | |


### Downstream process

 - concatenate sequences via SeqCombGo or catsequences or sequencematrix
 - coalescent / concatenated phylogeny

## Inner software

### sortdiamond

Usage: sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)

Default sseq is column 2, qstart is column 8, etc.

Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6. 

### splitfasta

Usage: splitfasta sample.fasta

It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.
add: readme 2024-07-05 13:14:38 +08:00			`# RGB EPP`

			`Reference Genome based Exon Phylogeny Pipeline`

			`License: GPL-2.0-only`
fix: md format 2024-07-05 13:22:50 +08:00
add: readme 2024-07-05 13:14:38 +08:00			`Author: Guoyi Zhang`

add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`## Requirements`
add: readme 2024-07-05 13:14:38 +08:00
add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`### External software`
add: readme 2024-07-05 13:14:38 +08:00
			`- fastp`
			`- spades.py (provided by spades)`
			`- diamond`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`- bowtie2`
			`- samtools`
			`- bcftools`
			`- exonerate (optional, only for --codon)`
add: readme 2024-07-05 13:14:38 +08:00			`- java`
			`- macse (default recognized path: /usr/share/java/macse.jar)`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`- trimal`
add: readme 2024-07-05 13:14:38 +08:00
add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`### Internal software`
add: readme 2024-07-05 13:14:38 +08:00
			`- sortdiamond (default recognized path: /usr/bin/sortdiamond)`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`- delstop (default recognized path: /usr/bin/delstop)`
add: readme 2024-07-05 13:14:38 +08:00
add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`## Arguments`
add: readme 2024-07-05 13:14:38 +08:00
add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`### Details`
add: readme 2024-07-05 13:14:38 +08:00
			```
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`-c --config config file for software path (optional)`
			`-g --genes gene file path (optional, if -r is specified)`
			`-f --functions functions type (optional): all clean assembly`
			`map postmap varcall consen codon align trim`
			`-h --help show this information`
			`-l --list list file path`
			`-m --memory memory settings (optional, default 16 GB)`
			`-r --reference reference genome path`
			`-t --threads threads setting (optional, default 8 threads)`
			`--codon Only use the codon region (optional)`
			`--fastp Fastp path (optional)`
			`--spades Spades python path (optional)`
			`--diamond Diamond python path (optional)`
			`--sortdiamond SortDiamond python path (optional)`
			`--bowtie2 Bowtie2 path (optional)`
			`--samtools Samtools path (optional)`
			`--bcftools Bcftools path (optional)`
			`--exonerate Exonerate path (optional)`
			`--macse Macse jarfile path (optional)`
			`--delstop Delstop path (optional)`
			`--trimal Trimal path (optional)`
			`for example: ./RGBEPP -f all -l list -t 8 -r reference.fasta`
add: readme 2024-07-05 13:14:38 +08:00			```

add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`### Directories Design`
add: readme 2024-07-05 13:14:38 +08:00
			```
			`.`
			`├── 00_raw`
			`├── 01_fastp`
			`├── 02_spades`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`├── 03_bowtie2`
			`├── 04_bam`
			`├── 05_vcf`
			`├── 06_consen`
			`├── 07_macse`
add: readme 2024-07-05 13:14:38 +08:00			`├── 08_macse`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`├── 08_trimal`
add: readme 2024-07-05 13:14:38 +08:00			`├── list`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`├── gene`
add: readme 2024-07-05 13:14:38 +08:00			`├── reference.aa.fasta`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`└── RGBEPP`
add: readme 2024-07-05 13:14:38 +08:00			```

add: 00_raw folders 2024-07-05 13:20:25 +08:00			`Each directory corresponds to each function.`

			`00_raw` should conatin all raw fastq.gz data.
add: readme 2024-07-05 13:14:38 +08:00
add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`### Text Files`
add: readme 2024-07-05 13:14:38 +08:00
			`list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command

			```
			`cd 00_raw`
			`ls \| sed "s@_R[12].fastq.gz@@g" > ../list`
			`cd ..`
			```

			`genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command

			```
			`grep '>' Reference.fasta \| sed "s@>@@g" > genes`
			```

			`reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format

add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`## Process`
add: readme 2024-07-05 13:14:38 +08:00
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`### RGBEPP functions`

add: readme 2024-07-05 13:14:38 +08:00			`- Function clean: Quality control + trimming (fastp)`
			`- Function assembly: de novo assembly (spades)`
polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`- Function map: local nucleic acids alignment search against amino acids subject sequence (diamond, sortdiamond), mapping raw reads to its scaffolds sequences (bowtie2)`
			`- Function postmap: Sorting and marking the read read alignment (samtools)`
			`- Function varcall: variant calling and filtering (bcftools)`
			`- Function consen: get consensus fasta file from vcf files (bcftools), then sort sequences based on gene name and taxa name (RGBEPP)`
			`- Function codon (optional): only extract the exon sequence (exonerate)`
			`- Function align: multiple sequence align based on condon (macse)`
			`- Function trim: trimming based on codon (trimal, delstop)`

			`### Arguments reuqirements for functions`

			`\| Functions \| -g/--gene \| -l/--list \| -r/--reference \|`
			`\| --------- \| --------- \| --------- \| -------------- \|`
			`\| clean \| \| ✔ \| \|`
			`\| assembly \| \| ✔ \| \|`
			`\| map \| \| ✔ \| ✔ \|`
			`\| postmap \| \| ✔ \| \|`
			`\| varcall \| \| ✔ \| \|`
			`\| consen \| ✔ \| ✔ \| \|`
			`\| codon \| ✔ \| \| ✔ \|`
			`\| align \| ✔ \| \| \|`
			`\| trim \| ✔ \| \| \|`

add: readme 2024-07-05 13:14:38 +08:00
add: more details on cpp binary 2024-07-05 15:17:31 +08:00			`### Downstream process`
add: readme 2024-07-05 13:14:38 +08:00
			`- concatenate sequences via SeqCombGo or catsequences or sequencematrix`
			`- coalescent / concatenated phylogeny`

polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`## Inner software`

			`### sortdiamond`
add: more details on cpp binary 2024-07-05 15:17:31 +08:00
			`Usage: sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)`

			`Default sseq is column 2, qstart is column 8, etc.`

			`Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6.`

polish: update README and part arguments 2024-12-09 14:24:08 +08:00			`### splitfasta`
add: more details on cpp binary 2024-07-05 15:17:31 +08:00
			`Usage: splitfasta sample.fasta`

			`It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.`

add: readme 2024-07-05 13:14:38 +08:00