Filter, align, and trim gene files followed by gene tree construction
-
Construct gene trees:
sgt_constructor.py [OPTIONS] -i <input_directory>
Required arguments:
-i
,--input <input_dir>
Path to output ofworking_dataset_constructor.py
Optional arguments:
-t
,--threads <N>
Number of threads- Default: 1
-
--no_trees
Do not build single gene trees -
--trees_only <in_dir>
Only build single gene trees. No other operations performed -o
,--output <out_dir>
Path to user-defined output directory- Default:
./sgt_constructor_out_<M.D.Y>
- Default:
-h
,--help
Show this help message and exit
Default
sgt_constructor.py
output:-
a directory
sgt_constructor_out_<M.D.Y>
that contains:-
a directory
prequal
that contains the files:-
{gene_name}.aa
- unaligned gene file used as PREQUAL input -
{gene_name}.aa.filtered
- PREQUAL filtered sequence file used as input for MAFFT in subsequent length filtration step -
{gene_name}.aa.filtered.PP
- output of PREQUAL ## -
{gene_name}.aa.warning
- output of PREQUAL ##
-
-
a directory
length_filtration
that contains:-
a directory
mafft
that contains the file:{gene_name}.aln
- aligned gene file in fasta format used as input for Divvier
-
a directory
divvier
that contains the files:-
{gene_name}.aln.divvy.fas
- Divvier filtered sequence file $$ -
{gene_name}.aln.PP
- output of Divvier $$
-
-
a directory
BMGE
that contains the files:-
{gene_name}.pre_bmge
- modified version of “gene_name.aln.divvy.fas” with the character “X” replaced by the character “-” used as input for BMGE -
{gene_name}.bmge
- output of BMGE and input for the length filtration step -
{gene_name}.length_filtered
- length filtered fasta file used as input for a second run of MAFFT
-
-
a directory
mafft
that contains the file:–
{gene_name}.aln2
- output of the second run of MAFFT and input for a second run of Divvier
-
-
a directory
divvier
that contains the files:-
{gene_name}.aln2.divvy.fas
- output of the second run of Divvier and input for timAl -
{gene_name}.aln2.PP
- output of Divvier $$
-
-
a directory
trimAl
that contains the files:-
{gene_name}.final
- trimmed alignment in fasta format and input for RAxML -
{gene_name}.final.reduced
- trimmed alignment in phylip format
-
-
a directory
RAxML
that contains the files:-
{gene_name}.raxml.bestTreeCollapsed
&& -
{gene_name}.raxml.rba
&& -
{gene_name}.raxml.startTree
&& -
{gene_name}.raxml.bestTree
&& -
{gene_name}.raxml.mlTrees
&& -
{gene_name}.raxml.support
&& -
{gene_name}.raxml.bestModel
&& -
{gene_name}.raxml.bootstraps
&& -
{gene_name}.raxml.log
&&
-
-
a directory
logs
that contains:-
a directory
prequal
that contains the files:{gene_name}.log
- the log file for each gene run through PREQUAL
-
a directory
length_filter_mafft
that contains the files:{gene_name}.log
- the log file for each gene run through the MAFFT step of length filtration
-
a directory
length_filter_divvier
that contains the files:{gene_name}.log
- the log file for each gene run through the Divvier step of length filtration
-
a directory
x_to_dash
that contains the files:{gene_name}.log
- the log file for each gene run through removal of X’s from gene files, as part of length filtration
-
a directory
length_filter_bmge
that contains the files:{gene_name}.log
- the log file for gene run through BMGE step of length filtration
-
a directory
length_filtration
that contains the files:{gene_name}.log
- the log file for the length filtration of gene
-
a directory
mafft
that contains the files:{gene_name}.log
- the log file for gene run through MAFFT
-
a directory
divvier
that contains the files:{gene_name}.log
- the log file for gene run through Divvier
-
a directory
trimal
that contains the files:{gene_name}.log
- the log file for gene run through trimAL
-
a directory
raxml
that contains the files:{gene_name}.log
- the log file for gene run through RAxML
-
-
a directory
trees
that contains:-
{gene_name}.trimmed
- the trimmed alignment file used for length filtration of each gene -
{gene_name}.final
- the trimmed alignment file used for tree construction by RAxML -
RAxML_bipartitions.{gene_name.tre}
- the final bootstrapped ML tree for each gene
-
-
-
a directory
sgt_constructor_out_<M.D.Y>-local
that contains:-
a directory
trees
that contains:-
{gene_name}.trimmed
- the trimmed alignment file used for length filtration of each gene -
{gene_name}.final
- the trimmed alignment file used for tree construction by RAxML -
RAxML_bipartitions.{gene_name}.tre
- the final bootstrapped ML tree for each gene
-
-
metadata.tsv
- the file that contains information about taxa already in the database -
input_metadata.tsv
- the input file that contains information about newly added taxa -
tree_colors.tsv
- the file used byforest.py
in the next step to color taxa by taxonomic affiliation
-
-
a file
sgt_constructor_out_<M.D.Y>.tar.gz
- a compressed file of the directorysgt_constructor_out_<M.D.Y>-local
created to ease the movement of all required data over to a local machine to render the svg files used in the next step. This is often necessary due to the lack of graphics capabilities of headless servers.
## - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation
for a thorough explanation of their contents.
$$ - These are standard Divvier output files for each gene that PhyloFisher has appended the corresponding gene name to. See the Divvier documentation
for a thorough explanation of their contents.
&& - These are standard RAxML-ng output files for each gene that PhyloFisher has appended the corresponding gene name to. See the RAxML-ng documentation
for a thorough explanation of their contents.
NOTE: For a detailed explanation of the methodology implemented in sgt_constructor.py
see “Automated Filtering, Alignment, Trimming, and Gene Tree Construction.”
NOTE: If sgt_constructor.py
dies in the middle of a run, simply provide the sgt_constructor_out_<M.D.Y>
output directory from the previous run to sgt_constructor.py
via the -o
flag in addition to the previous command and the script will pick up where it left off.
NOTE: If sgt_constructor.py
is circumnavigated to use alternative parameters for sequence filtering, alignment, and tree reconstruction the following criteria must be met to renter the workflow and downstream steps perform correctly:
NOTE: If sgt_constructor.py
is submitted to a compute node without internet access, the creation of internal conda environments will fail. To circumvent this, run sgt_constructor.py
from the command line on the head node until the conda environments are created. Then kill the process and submit sgt_constructor.py
to a compute node.
-
Trees must have been built using a maximum likelihood program. Downstream quality control steps are not set up to interpret Bayesian posterior probability values.
-
The directory structure of
sgt_constructor_out_<M.D.Y>-local
outlined above must be replicated. To do this-
Make a directory
my_directory
and withinmy_directory
directory make a subdirectory calledtrees
. -
Tree files must follow the naming convention
{nameofchoosing}.{gene_name}.tre
and be located in thetrees
subdirectory. -
Sequence files used to build the trees must follow the naming convention
{gene_name}.final
and be located in thetrees
subdirectory. -
Sequence files used for length filtering must follow the naming convention
{gene_named}.trimmed
and be located in thetrees
subdirectory. THESE FILES ARE OPTIONAL. -
The files
metadata.tsv
andtree_colors.tsv
can be copied fromPhyloFisherDatabase_v1.0/database/
and placed intomy_directory
along with theinput_metadata.tsv
file.
-