Filter, align, and trim gene files followed by gene tree construction
-
Construct gene trees:
sgt_constructor.py [OPTIONS] -i <input_directory>Required arguments:
-i,--input <input_dir>Path to output ofworking_dataset_constructor.py
Optional arguments:
-t,--threads <N>Number of threads- Default: 1
-
--no_treesDo not build single gene trees -
--trees_only <in_dir>Only build single gene trees. No other operations performed -o,--output <out_dir>Path to user-defined output directory- Default:
./sgt_constructor_out_<M.D.Y>
- Default:
-h,--helpShow this help message and exit
Default
sgt_constructor.pyoutput:-
a directory
sgt_constructor_out_<M.D.Y>that contains:-
a directory
prequalthat contains the files:-
{gene_name}.aa- unaligned gene file used as PREQUAL input -
{gene_name}.aa.filtered- PREQUAL filtered sequence file used as input for MAFFT in subsequent length filtration step -
{gene_name}.aa.filtered.PP- output of PREQUAL ## -
{gene_name}.aa.warning- output of PREQUAL ##
-
-
a directory
length_filtrationthat contains:-
a directory
mafftthat contains the file:{gene_name}.aln- aligned gene file in fasta format used as input for Divvier
-
a directory
divvierthat contains the files:-
{gene_name}.aln.divvy.fas- Divvier filtered sequence file $$ -
{gene_name}.aln.PP- output of Divvier $$
-
-
a directory
BMGEthat contains the files:-
{gene_name}.pre_bmge- modified version of “gene_name.aln.divvy.fas” with the character “X” replaced by the character “-” used as input for BMGE -
{gene_name}.bmge- output of BMGE and input for the length filtration step -
{gene_name}.length_filtered- length filtered fasta file used as input for a second run of MAFFT
-
-
a directory
mafftthat contains the file:–
{gene_name}.aln2- output of the second run of MAFFT and input for a second run of Divvier
-
-
a directory
divvierthat contains the files:-
{gene_name}.aln2.divvy.fas- output of the second run of Divvier and input for timAl -
{gene_name}.aln2.PP- output of Divvier $$
-
-
a directory
trimAlthat contains the files:-
{gene_name}.final- trimmed alignment in fasta format and input for RAxML -
{gene_name}.final.reduced- trimmed alignment in phylip format
-
-
a directory
RAxMLthat contains the files:-
{gene_name}.raxml.bestTreeCollapsed&& -
{gene_name}.raxml.rba&& -
{gene_name}.raxml.startTree&& -
{gene_name}.raxml.bestTree&& -
{gene_name}.raxml.mlTrees&& -
{gene_name}.raxml.support&& -
{gene_name}.raxml.bestModel&& -
{gene_name}.raxml.bootstraps&& -
{gene_name}.raxml.log&&
-
-
a directory
logsthat contains:-
a directory
prequalthat contains the files:{gene_name}.log- the log file for each gene run through PREQUAL
-
a directory
length_filter_mafftthat contains the files:{gene_name}.log- the log file for each gene run through the MAFFT step of length filtration
-
a directory
length_filter_divvierthat contains the files:{gene_name}.log- the log file for each gene run through the Divvier step of length filtration
-
a directory
x_to_dashthat contains the files:{gene_name}.log- the log file for each gene run through removal of X’s from gene files, as part of length filtration
-
a directory
length_filter_bmgethat contains the files:{gene_name}.log- the log file for gene run through BMGE step of length filtration
-
a directory
length_filtrationthat contains the files:{gene_name}.log- the log file for the length filtration of gene
-
a directory
mafftthat contains the files:{gene_name}.log- the log file for gene run through MAFFT
-
a directory
divvierthat contains the files:{gene_name}.log- the log file for gene run through Divvier
-
a directory
trimalthat contains the files:{gene_name}.log- the log file for gene run through trimAL
-
a directory
raxmlthat contains the files:{gene_name}.log- the log file for gene run through RAxML
-
-
a directory
treesthat contains:-
{gene_name}.trimmed- the trimmed alignment file used for length filtration of each gene -
{gene_name}.final- the trimmed alignment file used for tree construction by RAxML -
RAxML_bipartitions.{gene_name.tre}- the final bootstrapped ML tree for each gene
-
-
-
a directory
sgt_constructor_out_<M.D.Y>-localthat contains:-
a directory
treesthat contains:-
{gene_name}.trimmed- the trimmed alignment file used for length filtration of each gene -
{gene_name}.final- the trimmed alignment file used for tree construction by RAxML -
RAxML_bipartitions.{gene_name}.tre- the final bootstrapped ML tree for each gene
-
-
metadata.tsv- the file that contains information about taxa already in the database -
input_metadata.tsv- the input file that contains information about newly added taxa -
tree_colors.tsv- the file used byforest.pyin the next step to color taxa by taxonomic affiliation
-
-
a file
sgt_constructor_out_<M.D.Y>.tar.gz- a compressed file of the directorysgt_constructor_out_<M.D.Y>-localcreated to ease the movement of all required data over to a local machine to render the svg files used in the next step. This is often necessary due to the lack of graphics capabilities of headless servers.
## - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation for a thorough explanation of their contents.
$$ - These are standard Divvier output files for each gene that PhyloFisher has appended the corresponding gene name to. See the Divvier documentation for a thorough explanation of their contents.
&& - These are standard RAxML-ng output files for each gene that PhyloFisher has appended the corresponding gene name to. See the RAxML-ng documentation for a thorough explanation of their contents.
NOTE: For a detailed explanation of the methodology implemented in sgt_constructor.py see “Automated Filtering, Alignment, Trimming, and Gene Tree Construction.”
NOTE: If sgt_constructor.py dies in the middle of a run, simply provide the sgt_constructor_out_<M.D.Y> output directory from the previous run to sgt_constructor.py via the -o flag in addition to the previous command and the script will pick up where it left off.
NOTE: If sgt_constructor.py is circumnavigated to use alternative parameters for sequence filtering, alignment, and tree reconstruction the following criteria must be met to renter the workflow and downstream steps perform correctly:
NOTE: If sgt_constructor.py is submitted to a compute node without internet access, the creation of internal conda environments will fail. To circumvent this, run sgt_constructor.py from the command line on the head node until the conda environments are created. Then kill the process and submit sgt_constructor.py to a compute node.
-
Trees must have been built using a maximum likelihood program. Downstream quality control steps are not set up to interpret Bayesian posterior probability values.
-
The directory structure of
sgt_constructor_out_<M.D.Y>-localoutlined above must be replicated. To do this-
Make a directory
my_directoryand withinmy_directorydirectory make a subdirectory calledtrees. -
Tree files must follow the naming convention
{nameofchoosing}.{gene_name}.treand be located in thetreessubdirectory. -
Sequence files used to build the trees must follow the naming convention
{gene_name}.finaland be located in thetreessubdirectory. -
Sequence files used for length filtering must follow the naming convention
{gene_named}.trimmedand be located in thetreessubdirectory. THESE FILES ARE OPTIONAL. -
The files
metadata.tsvandtree_colors.tsvcan be copied fromPhyloFisherDatabase_v1.0/database/and placed intomy_directoryalong with theinput_metadata.tsvfile.
-