fast_taxa_remover.py

Remove the fastest evolving taxa from a matrix based on branch length.

fast_taxa_remover.py [OPTIONS] -m <input_matrix> -tr <input_tree>

Required arguments:

  • -m, --matrix <matrix.fas|nex|phy> Path to input matrix.
  • -tr, --tree <tree> Path to input tree.
  • -i, --iterations <N> Number of iterations.
  • ` -or, –ortholog_files Path to directory containing the individual ortholog files. This will be the path to prep_final_dataset_` if used within the main PhyloFisher workflow.

Optional arguments:

  • -in_format <format> Input matrix format.
    • Options: fasta, phylip (names truncated at 10 characters), phylip-relaxed (names are not truncated), or nexus.
    • Default: fasta
  • -out_format <format> Desired output format.
    • Options: fasta, phylip (names truncated at 10 characters), phylip-relaxed (names are not truncated), or nexus.
    • Default: fasta
  • -s, --step_size <N> Number taxa removed per iteration.
    • Default: 1
  • -o, --output <out_dir> Path to user-defined output directory.
    • Default: ./fast_taxa_removal_out_<M.D.Y>
  • -t, --threads <N> Desired number of threads to be utilized.
    • Default: 1
  • -h, - Show this help message and exit.

fast_taxa_remover.py output:

  • a directory called ./fast_taxa_remover_out_<M.D.Y> with sub-directories:
    • steps_<N> (N=step size) that contain:
  • a directory prequal that contains:
    • {gene_name}.aa - unaligned gene file used as PREQUAL input.
    • {gene_name}.aa.filtered - output of PREQUAL. Used as input for MAFFT in subsequent length filtration step.
    • {gene_name}.aa.filtered.PP - output of PREQUAL.&
    • {gene_name}.aa.warning - output of PREQUAL.&
  • a directory mafft that contains:
    • {gene_name}.aln - output of MAFFT and input for Divvier.
  • a directory divvier that contains:
    • {gene_name}.aln.partial.fas - output of Divvier and input for timAl.
    • {gene_name}.aln.PP - output of Divvier.&&
  • a directory trimAl that contains:
    • {gene_name}.gt80trimAl - output of trimAl. Trimmed alignments that will be used for concatenation.
  • indices.tsv - a tab separated file with three columns outlining the single gene boundaries in the supermatrix. These columns are:
    1. Gene - name of the gene.
    2. Start - first position of the gene within the super matrix.
    3. Stop - last position of the gene within the super matrix.
  • matrix.<fas|nex|phy> - the concatenated super matrix of all genes in the provided input directory in the specified file format.
  • matrix_constructor_stats.tsv - a tab separated file with two columns:
    1. Taxon -Unique ID of taxon in the database.
    2. Percent Missing Data - percentage of unoccupied sites within the concatenated matrix.

& - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation for a thorough explanation of their contents.

&& - These are standard Divvier output files for each gene that PhyloFisher has appended the corresponding gene name to. See the Divvier documentation for a thorough explanation of their contents.