Filter, align, and trim gene files followed by gene tree construction

Construct gene trees:

sgt_constructor.py [OPTIONS] -i <input_directory>

Required arguments:
- -i, --input <input_dir> Path to output of working_dataset_constructor.py
Optional arguments:
- -t, --threads <N> Number of threads
  - Default: 1
- --no_trees Do not build single gene trees
- --trees_only <in_dir> Only build single gene trees. No other operations performed
- -o, --output <out_dir> Path to user-defined output directory
  - Default: ./sgt_constructor_out_<M.D.Y>
- -h, --help Show this help message and exit
Default sgt_constructor.py output:
- a directory sgt_constructor_out_<M.D.Y> that contains:
  - a directory prequal that contains the files:
    - {gene_name}.aa - unaligned gene file used as PREQUAL input
    - {gene_name}.aa.filtered - PREQUAL filtered sequence file used as input for MAFFT in subsequent length filtration step
    - {gene_name}.aa.filtered.PP - output of PREQUAL ##
    - {gene_name}.aa.warning - output of PREQUAL ##
  - a directory length_filtration that contains:
    - a directory mafft that contains the file:
      - {gene_name}.aln - aligned gene file in fasta format used as input for Divvier
    - a directory divvier that contains the files:
      - {gene_name}.aln.divvy.fas - Divvier filtered sequence file $$
      - {gene_name}.aln.PP - output of Divvier $$
    - a directory BMGE that contains the files:
      - {gene_name}.pre_bmge - modified version of “gene_name.aln.divvy.fas” with the character “X” replaced by the character “-” used as input for BMGE
      - {gene_name}.bmge - output of BMGE and input for the length filtration step
      - {gene_name}.length_filtered - length filtered fasta file used as input for a second run of MAFFT
    - a directory mafft that contains the file:
      
      – {gene_name}.aln2 - output of the second run of MAFFT and input for a second run of Divvier
  - a directory divvier that contains the files:
    - {gene_name}.aln2.divvy.fas - output of the second run of Divvier and input for timAl
    - {gene_name}.aln2.PP - output of Divvier $$
  - a directory trimAl that contains the files:
    - {gene_name}.final - trimmed alignment in fasta format and input for RAxML
    - {gene_name}.final.reduced - trimmed alignment in phylip format
  - a directory RAxML that contains the files:
    - {gene_name}.raxml.bestTreeCollapsed &&
    - {gene_name}.raxml.rba &&
    - {gene_name}.raxml.startTree &&
    - {gene_name}.raxml.bestTree &&
    - {gene_name}.raxml.mlTrees &&
    - {gene_name}.raxml.support &&
    - {gene_name}.raxml.bestModel &&
    - {gene_name}.raxml.bootstraps &&
    - {gene_name}.raxml.log &&
  - a directory logs that contains:
    - a directory prequal that contains the files:
      - {gene_name}.log - the log file for each gene run through PREQUAL
    - a directory length_filter_mafft that contains the files:
      - {gene_name}.log - the log file for each gene run through the MAFFT step of length filtration
    - a directory length_filter_divvier that contains the files:
      - {gene_name}.log - the log file for each gene run through the Divvier step of length filtration
    - a directory x_to_dash that contains the files:
      - {gene_name}.log - the log file for each gene run through removal of X’s from gene files, as part of length filtration
    - a directory length_filter_bmge that contains the files:
      - {gene_name}.log - the log file for gene run through BMGE step of length filtration
    - a directory length_filtration that contains the files:
      - {gene_name}.log - the log file for the length filtration of gene
    - a directory mafft that contains the files:
      - {gene_name}.log - the log file for gene run through MAFFT
    - a directory divvier that contains the files:
      - {gene_name}.log - the log file for gene run through Divvier
    - a directory trimal that contains the files:
      - {gene_name}.log - the log file for gene run through trimAL
    - a directory raxml that contains the files:
      - {gene_name}.log - the log file for gene run through RAxML
  - a directory trees that contains:
    - {gene_name}.trimmed - the trimmed alignment file used for length filtration of each gene
    - {gene_name}.final - the trimmed alignment file used for tree construction by RAxML
    - RAxML_bipartitions.{gene_name.tre} - the final bootstrapped ML tree for each gene
- a directory sgt_constructor_out_<M.D.Y>-local that contains:
  - a directory trees that contains:
    - {gene_name}.trimmed - the trimmed alignment file used for length filtration of each gene
    - {gene_name}.final - the trimmed alignment file used for tree construction by RAxML
    - RAxML_bipartitions.{gene_name}.tre - the final bootstrapped ML tree for each gene
  - metadata.tsv - the file that contains information about taxa already in the database
  - input_metadata.tsv - the input file that contains information about newly added taxa
  - tree_colors.tsv - the file used by forest.py in the next step to color taxa by taxonomic affiliation
- a file sgt_constructor_out_<M.D.Y>.tar.gz - a compressed file of the directory sgt_constructor_out_<M.D.Y>-local created to ease the movement of all required data over to a local machine to render the svg files used in the next step. This is often necessary due to the lack of graphics capabilities of headless servers.

## - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation for a thorough explanation of their contents.

$$ - These are standard Divvier output files for each gene that PhyloFisher has appended the corresponding gene name to. See the Divvier documentation for a thorough explanation of their contents.

&& - These are standard RAxML-ng output files for each gene that PhyloFisher has appended the corresponding gene name to. See the RAxML-ng documentation for a thorough explanation of their contents.

NOTE: For a detailed explanation of the methodology implemented in sgt_constructor.py see “Automated Filtering, Alignment, Trimming, and Gene Tree Construction.”

NOTE: If sgt_constructor.py dies in the middle of a run, simply provide the sgt_constructor_out_<M.D.Y> output directory from the previous run to sgt_constructor.py via the -o flag in addition to the previous command and the script will pick up where it left off.

NOTE: If sgt_constructor.py is circumnavigated to use alternative parameters for sequence filtering, alignment, and tree reconstruction the following criteria must be met to renter the workflow and downstream steps perform correctly:

NOTE: If sgt_constructor.py is submitted to a compute node without internet access, the creation of internal conda environments will fail. To circumvent this, run sgt_constructor.py from the command line on the head node until the conda environments are created. Then kill the process and submit sgt_constructor.py to a compute node.

Trees must have been built using a maximum likelihood program. Downstream quality control steps are not set up to interpret Bayesian posterior probability values.
The directory structure of sgt_constructor_out_<M.D.Y>-local outlined above must be replicated. To do this
1. Make a directory my_directory and within my_directory directory make a subdirectory called trees.
2. Tree files must follow the naming convention {nameofchoosing}.{gene_name}.tre and be located in the trees subdirectory.
3. Sequence files used to build the trees must follow the naming convention {gene_name}.final and be located in the trees subdirectory.
4. Sequence files used for length filtering must follow the naming convention {gene_named}.trimmed and be located in the trees subdirectory. THESE FILES ARE OPTIONAL.
5. The files metadata.tsv and tree_colors.tsv can be copied from PhyloFisherDatabase_v1.0/database/ and placed into my_directory along with the input_metadata.tsv file.