RGBEPP/README.md

122 lines
3.5 KiB
Markdown
Raw Normal View History

2024-07-05 13:14:38 +08:00
# RGB EPP
Reference Genome based Exon Phylogeny Pipeline
License: GPL-2.0-only
2024-07-05 13:22:50 +08:00
2024-07-05 13:14:38 +08:00
Author: Guoyi Zhang
2024-07-05 15:17:31 +08:00
## Requirements
2024-07-05 13:14:38 +08:00
2024-07-05 15:17:31 +08:00
### External software
2024-07-05 13:14:38 +08:00
- GNU Bash (provide cd)
- GNU coreutils (provide cp mv mkdir mv)
- GNU findutils (provide find)
- fastp
- spades.py (provided by spades)
- diamond
- java
- macse (default recognized path: /usr/share/java/macse.jar)
- GNU parallel
2024-07-05 15:17:31 +08:00
### Internal software
2024-07-05 13:14:38 +08:00
- splitfasta (default recognized path: /usr/bin/splitfasta)
- sortdiamond (default recognized path: /usr/bin/sortdiamond)
2024-07-05 15:17:31 +08:00
## Arguments
2024-07-05 13:14:38 +08:00
2024-07-05 15:17:31 +08:00
### Details
2024-07-05 13:14:38 +08:00
```
-c --contigs contings type: scaffolds or contigs
-g --genes gene file path
-f --functions functions type (optional): all clean
assembly fasta map pre split merge align
-h --help show this information
-l --list list file path
-m --memory memory settings (optional, default 16 GB)
-r --reference reference genome path
-t --threads threads setting (optional, default 8 threads)
--macse Macse jarfile path
--sortdiamond sortdiamond file path
--splitfasta splitfasta file path
for example: bash RGBEPP.sh -c scaffolds -f all -l list -g genes -r reference.aa.fasta
```
2024-07-05 15:17:31 +08:00
### Directories Design
2024-07-05 13:14:38 +08:00
```
.
├── 00_raw
├── 01_fastp
├── 02_spades
├── 03_assemblied
├── 04_diamond
├── 05_pre
├── 06_split
├── 07_merge
├── 08_macse
├── genes
├── list
├── reference.aa.fasta
└── RGBEPP.sh
```
2024-07-05 13:20:25 +08:00
Each directory corresponds to each function.
`00_raw` should conatin all raw fastq.gz data.
2024-07-05 13:14:38 +08:00
2024-07-05 15:17:31 +08:00
### Text Files
2024-07-05 13:14:38 +08:00
`list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command
```
cd 00_raw
ls | sed "s@_R[12].fastq.gz@@g" > ../list
cd ..
```
`genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command
```
grep '>' Reference.fasta | sed "s@>@@g" > genes
```
`reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format
2024-07-05 15:17:31 +08:00
## Process
2024-07-05 13:14:38 +08:00
2024-07-05 15:17:31 +08:00
### RGBEPP.sh functions
2024-07-05 13:14:38 +08:00
- Function clean: Quality control + trimming (fastp)
- Function assembly: de novo assembly (spades)
- Function fasta: gather all fasta files from assembly directories (RGBEPP.sh)
- Function map: local nucleic acids alignment search against amino acids subject sequence (diamond)
- Function pre: generate corresponding sequences based on blast-styled output (sortdiamond)
- Function split: splitting fasta sequence to directories based on the reference genome (splitfasta)
- Function merge: merge different taxa in the same reference exon gene to one fasta (RGBEPP.sh)
- Function align: multiple sequence align based on Condon (macse)
2024-07-05 15:17:31 +08:00
### Downstream process
2024-07-05 13:14:38 +08:00
- concatenate sequences via SeqCombGo or catsequences or sequencematrix
- coalescent / concatenated phylogeny
2024-07-05 15:17:31 +08:00
# sortdiamond
Usage: sortdiamond diamond_output.m8 generated.fasta sseq,qstart,qend,bitscore/evalue,qseq(optional, default 1,6,7,11,17, start from 0) bitscore/evalue(optional, default bitscore)
Default sseq is column 2, qstart is column 8, etc.
Diamond default output format (--outfmt 6) does not contain qseq, you must custom the output format under output format 6.
# splitfasta
Usage: splitfasta sample.fasta
It always creates directories in the path that you run the splitfasta, and puts split fasta into the directory.
2024-07-05 13:14:38 +08:00