fast_taxa_remover.py
Remove the fastest evolving taxa from a matrix based on branch length.
fast_taxa_remover.py [OPTIONS] -m <input_matrix> -tr <input_tree>
Required arguments:
-m
,--matrix <matrix.fas|nex|phy>
Path to input matrix.-tr
,--tree <tree>
Path to input tree.-i
,--iterations <N>
Number of iterations.- ` -or
,
–ortholog_filesPath to directory containing the individual ortholog files. This will be the path to
prep_final_dataset_` if used within the main PhyloFisher workflow.
Optional arguments:
-in_format <format>
Input matrix format.- Options:
fasta
,phylip
(names truncated at 10 characters),phylip-relaxed
(names are not truncated), ornexus
. - Default:
fasta
- Options:
-out_format <format>
Desired output format.- Options:
fasta
,phylip
(names truncated at 10 characters),phylip-relaxed
(names are not truncated), ornexus
. - Default:
fasta
- Options:
-s
,--step_size <N>
Number taxa removed per iteration.- Default: 1
-o
,--output <out_dir>
Path to user-defined output directory.- Default:
./fast_taxa_removal_out_<M.D.Y>
- Default:
-t
,--threads <N>
Desired number of threads to be utilized.- Default:
1
- Default:
-h
, - Show this help message and exit.
fast_taxa_remover.py
output:
- a directory called
./fast_taxa_remover_out_<M.D.Y>
with sub-directories:steps_<N>
(N=step size) that contain:
- a directory
prequal
that contains:{gene_name}.aa
- unaligned gene file used as PREQUAL input.{gene_name}.aa.filtered
- output of PREQUAL. Used as input for MAFFT in subsequent length filtration step.{gene_name}.aa.filtered.PP
- output of PREQUAL.&{gene_name}.aa.warning
- output of PREQUAL.&
- a directory
mafft
that contains:{gene_name}.aln
- output of MAFFT and input for Divvier.
- a directory
divvier
that contains:{gene_name}.aln.partial.fas
- output of Divvier and input for timAl.{gene_name}.aln.PP
- output of Divvier.&&
- a directory
trimAl
that contains:{gene_name}.gt80trimAl
- output of trimAl. Trimmed alignments that will be used for concatenation.
indices.tsv
- a tab separated file with three columns outlining the single gene boundaries in the supermatrix. These columns are:- Gene - name of the gene.
- Start - first position of the gene within the super matrix.
- Stop - last position of the gene within the super matrix.
matrix.<fas|nex|phy>
- the concatenated super matrix of all genes in the provided input directory in the specified file format.matrix_constructor_stats.tsv
- a tab separated file with two columns:- Taxon -Unique ID of taxon in the database.
- Percent Missing Data - percentage of unoccupied sites within the concatenated matrix.
& - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation for a thorough explanation of their contents.
&& - These are standard Divvier output files for each gene that PhyloFisher has appended the corresponding gene name to. See the Divvier documentation for a thorough explanation of their contents.