fast_taxa_remover.py
Remove the fastest evolving taxa from a matrix based on branch length.
fast_taxa_remover.py [OPTIONS] -m <input_matrix> -tr <input_tree>
Required arguments:
-m,--matrix <matrix.fas|nex|phy>Path to input matrix.-tr,--tree <tree>Path to input tree.-i,--iterations <N>Number of iterations.- ` -or
,–ortholog_filesPath to directory containing the individual ortholog files. This will be the path toprep_final_dataset_` if used within the main PhyloFisher workflow.
Optional arguments:
-in_format <format>Input matrix format.- Options:
fasta,phylip(names truncated at 10 characters),phylip-relaxed(names are not truncated), ornexus. - Default:
fasta
- Options:
-out_format <format>Desired output format.- Options:
fasta,phylip(names truncated at 10 characters),phylip-relaxed(names are not truncated), ornexus. - Default:
fasta
- Options:
-s,--step_size <N>Number taxa removed per iteration.- Default: 1
-o,--output <out_dir>Path to user-defined output directory.- Default:
./fast_taxa_removal_out_<M.D.Y>
- Default:
-t,--threads <N>Desired number of threads to be utilized.- Default:
1
- Default:
-h, - Show this help message and exit.
fast_taxa_remover.py output:
- a directory called
./fast_taxa_remover_out_<M.D.Y>with sub-directories:steps_<N>(N=step size) that contain:
- a directory
prequalthat contains:{gene_name}.aa- unaligned gene file used as PREQUAL input.{gene_name}.aa.filtered- output of PREQUAL. Used as input for MAFFT in subsequent length filtration step.{gene_name}.aa.filtered.PP- output of PREQUAL.&{gene_name}.aa.warning- output of PREQUAL.&
- a directory
mafftthat contains:{gene_name}.aln- output of MAFFT and input for Divvier.
- a directory
divvierthat contains:{gene_name}.aln.partial.fas- output of Divvier and input for timAl.{gene_name}.aln.PP- output of Divvier.&&
- a directory
trimAlthat contains:{gene_name}.gt80trimAl- output of trimAl. Trimmed alignments that will be used for concatenation.
indices.tsv- a tab separated file with three columns outlining the single gene boundaries in the supermatrix. These columns are:- Gene - name of the gene.
- Start - first position of the gene within the super matrix.
- Stop - last position of the gene within the super matrix.
matrix.<fas|nex|phy>- the concatenated super matrix of all genes in the provided input directory in the specified file format.matrix_constructor_stats.tsv- a tab separated file with two columns:- Taxon -Unique ID of taxon in the database.
- Percent Missing Data - percentage of unoccupied sites within the concatenated matrix.
& - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation for a thorough explanation of their contents.
&& - These are standard Divvier output files for each gene that PhyloFisher has appended the corresponding gene name to. See the Divvier documentation for a thorough explanation of their contents.