add: readme

2024-07-05 15:14:38 +10:00 · 2024-07-05 15:14:38 +10:00 · 32c423a283
commit 32c423a283
parent 84575da5b8
1 changed files with 104 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,104 @@
+# RGB EPP
+
+Reference Genome based Exon Phylogeny Pipeline
+
+License: GPL-2.0-only
+Author: Guoyi Zhang
+
+# Requirements
+
+## External software 
+
+- GNU Bash (provide cd)
+- GNU coreutils (provide cp mv mkdir mv)
+- GNU findutils (provide find)
+- fastp
+- spades.py (provided by spades)
+- diamond
+- java
+- macse (default recognized path: /usr/share/java/macse.jar)
+- GNU parallel
+
+## Internal software
+
+- splitfasta (default recognized path: /usr/bin/splitfasta)
+- sortdiamond (default recognized path: /usr/bin/sortdiamond)
+
+# Arguments
+
+## Details
+
+```
+-c	--contigs	contings type: scaffolds or contigs
+-g	--genes		gene file path
+-f	--functions	functions type (optional): all clean 
+	  		assembly fasta map pre split merge align
+-h	--help		show this information
+-l	--list		list file path
+-m	--memory	memory settings (optional, default 16 GB)
+-r	--reference	reference genome path
+-t	--threads	threads setting (optional, default 8 threads)
+	--macse		Macse jarfile path
+	--sortdiamond	sortdiamond file path
+	--splitfasta	splitfasta file path
+for example: bash RGBEPP.sh -c scaffolds -f all -l list -g genes -r reference.aa.fasta 
+```
+
+## Directories Design
+
+```
+.
+├── 00_raw
+├── 01_fastp
+├── 02_spades
+├── 03_assemblied
+├── 04_diamond
+├── 05_pre
+├── 06_split
+├── 07_merge
+├── 08_macse
+├── genes
+├── list
+├── reference.aa.fasta
+└── RGBEPP.sh
+```
+
+Each directory corresponds to each function. 
+
+## Text Files
+
+`list` is the text file containing all samples, if your raw data is following the style ${list_name}\_R1.fastq.gz and  ${list_name}\_R2.fastq.gz, ${list_name} is what you should list in `list` file. The easy way to get it in Linux/Unix system is the following command
+
+```
+cd 00_raw
+ls | sed "s@_R[12].fastq.gz@@g" > ../list
+cd ..
+```
+
+`genes` is the text file containing all gene names from the reference fasta file. The easy way to get it in Linux/Unix system is the following command
+
+```
+grep '>' Reference.fasta | sed "s@>@@g" > genes
+```
+
+`reference.aa.fasta` can be replaced by another other name, but it must contain reference amino acids genome in fasta format
+
+# Progress
+
+## RGBEPP.sh functions
+
+ - Function clean: Quality control + trimming (fastp)
+ - Function assembly: de novo assembly (spades)
+ - Function fasta: gather all fasta files from assembly directories (RGBEPP.sh)
+ - Function map: local nucleic acids alignment search against amino acids subject sequence (diamond)
+ - Function pre: generate corresponding sequences based on blast-styled output (sortdiamond) 
+ - Function split: splitting fasta sequence to directories based on the reference genome (splitfasta)
+ - Function merge: merge different taxa in the same reference exon gene to one fasta (RGBEPP.sh)
+ - Function align: multiple sequence align based on Condon (macse)
+
+## Downstream process
+
+ - concatenate sequences via SeqCombGo or catsequences or sequencematrix
+ - coalescent / concatenated phylogeny
+
+