Prepare Custom Database
- Retrieve the recommended PhyloFisher directory structure, OrthoMCL database, example
metadata.tsv, andtree_colors.tsvfile viawget
wget https://ndownloader.figshare.com/files/29093325 - Uncompress the file
29093325
tar -xzvf 29093325 - Move into the directory
PhyloFisher_FOR_CUSTOM_DATABASE
cd PhyloFisher_FOR_CUSTOM_DATABASE - Place your single gene files of orthologs in the directory
/database/orthologs/- The ortholog files must be in fasta format.
- Each ortholog file must be named with the following convention
{gene_name}.fas.- Ex.
RPL7.fas
- Ex.
- Each individual taxon should have a Unique ID as the header in all ortholog files. This Unique ID must be the the same in all ortholog files.
- Each taxon can be present only once in each ortholog file.
- Place files of known paralogs for each gene in the directory
/database/paralogs/(OPTIONAL)- Each gene file must be named with the following convention
{gene_name}_paralogs.fas.- Ex.
RPL7_paralogs.fas
- Ex.
- Each individual taxon should have a Unique ID as the header in all paralog files. This Unique ID must be the the same in all paralog files and the corresponding ortholog files.
- Each taxon can be present more than once in each paralog file.
- Each gene file must be named with the following convention
- Place the complete proteome of each taxon present in the ortholog files in
/database/proteomes- All proteomes must be in fasta format
- All proteomes must be tar and gzipped and follow the naming convention
{Unique ID}.tar.gz
-
Fill out the metadata.tsv file found in
PhyloFisher_FOR_CUSTOM_DATABASE/database</br> Detailed instructions on preparing themetadata.tsvfile can be found here. - Run
build_database.py. Detailed instructions onbuild_database.pycan be found here.
Some notes about sequence headers:
- Each sequence header (sequence header = Unique ID) within a file must be unique. Sequence headers must be the same across all files for a taxon and must be the same as the Unique ID provided in the
metadata.tsvfile for the taxon. This is true for both ortholog and paralog fasta files. - Sequence headers cannot contain underscores “_”, at symbols “@”, white spaces, or double dots “..”. This is true for both ortholog and paralog fasta files.
- If you provided separate sequence files containing paralogs, each sequence header within each file will have “..p
” appended to the end by `build_database.py`.