From 32c423a2831aa7958217a1fbc627530fe3c722fc Mon Sep 17 00:00:00 2001 From: Guoyi Zhang Date: Fri, 5 Jul 2024 15:14:38 +1000 Subject: [PATCH] add: readme --- README.md | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..b80823b --- /dev/null +++ b/README.md @@ -0,0 +1,104 @@ +# RGB EPP + +Reference Genome based Exon Phylogeny Pipeline + +License: GPL-2.0-only +Author: Guoyi Zhang + +# Requirements + +## External software + +- GNU Bash (provide cd) +- GNU coreutils (provide cp mv mkdir mv) +- GNU findutils (provide find) +- fastp +- spades.py (provided by spades) +- diamond +- java +- macse (default recognized path: /usr/share/java/macse.jar) +- GNU parallel + +## Internal software + +- splitfasta (default recognized path: /usr/bin/splitfasta) +- sortdiamond (default recognized path: /usr/bin/sortdiamond) + +# Arguments + +## Details + +``` +-c --contigs contings type: scaffolds or contigs +-g --genes gene file path +-f --functions functions type (optional): all clean + assembly fasta map pre split merge align +-h --help show this information +-l --list list file path +-m --memory memory settings (optional, default 16 GB) +-r --reference reference genome path +-t --threads threads setting (optional, default 8 threads) + --macse Macse jarfile path + --sortdiamond sortdiamond file path + --splitfasta splitfasta file path +for example: bash RGBEPP.sh -c scaffolds -f all -l list -g genes -r reference.aa.fasta +``` + +## Directories Design + +``` +. +├── 00_raw +├── 01_fastp +├── 02_spades +├── 03_assemblied +├── 04_diamond +├── 05_pre +├── 06_split +├── 07_merge +├── 08_macse +├── genes +├── list +├── reference.aa.fasta +└── RGBEPP.sh +``` + +Each directory corresponds to each function. + +## Text Files + +`list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command + +``` +cd 00_raw +ls | sed "s@_R[12].fastq.gz@@g" > ../list +cd .. +``` + +`genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command + +``` +grep '>' Reference.fasta | sed "s@>@@g" > genes +``` + +`reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format + +# Progress + +## RGBEPP.sh functions + + - Function clean: Quality control + trimming (fastp) + - Function assembly: de novo assembly (spades) + - Function fasta: gather all fasta files from assembly directories (RGBEPP.sh) + - Function map: local nucleic acids alignment search against amino acids subject sequence (diamond) + - Function pre: generate corresponding sequences based on blast-styled output (sortdiamond) + - Function split: splitting fasta sequence to directories based on the reference genome (splitfasta) + - Function merge: merge different taxa in the same reference exon gene to one fasta (RGBEPP.sh) + - Function align: multiple sequence align based on Condon (macse) + +## Downstream process + + - concatenate sequences via SeqCombGo or catsequences or sequencematrix + - coalescent / concatenated phylogeny + +