Matrix Construction
Trim, align, and concatenate orthologs into a super-matrix.
To do this run: matrix_constructor.py [OPTIONS] -i <input_directory>
Required arguments:
-i,--input <in_dir>Path toprep_final_dataset_<M.D.Y>
Optional arguments:
-f,--out_format <format>Desired format of the output matrix- Options:
fasta,phylip,phylip-relaxed, ornexus. - Default:
fasta
- Options:
-if,--in_format <format>format of the input files.- Options:
fasta,phylip,phylip-relaxed, ornexus. - Default:
fasta
- Options:
-c,--concatenation_onlyOnly concatenate alignments. Filtering, alignment, and trimming are not performed automatically.-t,--threads <N>Desired number of threads to be utilized.- Default:
1
- Default:
-o,--output <out_dir>Path to user-defined output directory- Default:
./matrix_constructor_out_<M.D.Y>
- Default:
-p,--prefix <prefix>Prefix of input files- Default:
None - Example:
path/to/input/prefix*
- Default:
-s,--suffix <suffix>Suffix of input files- Default:
None - Example:
path/to/input/*suffix
- Default:
-h,--helpShow this help message and exit
Default matrix_constructor.py output:
- a directory called
matrix_constructor_out_<M.D.Y>containing:- a directory
prequalthat contains:{gene_name}.aa- unaligned gene file used as PREQUAL input{gene_name}.aa.filtered- output of PREQUAL. Used as input for MAFFT in subsequent length filtration step.{gene_name}.aa.filtered.PP- output of PREQUAL.&&{gene_name}.aa.warning- output of PREQUAL.&&
- a directory
mafftthat contains:{gene_name}.aln- output of of MAFFT and input for Divvier.{gene_name}.aln.PP- output of PREQUAL.&&
- a directory
divvierthat contains:{gene_name}.aln.partial.fas- output of Divvier and input for timAl.
- a directory
trimAlthat contains:{gene_name}.gt80trimAl.fas- output of trimAl. Trimmed alignments, in FASTA format, that will be used for concatenation and can be used to compute single ortholog trees for use in coalescent-based analyses.
indices.tsv- a tab separated file with three columns outlining the single gene boundaries in the supermatrix:- Gene - name of the gene
- Start - first position of the gene within the super matrix
- Stop - last position of the gene within the super matrix
matrix.<fas | nex | phy>- the concatenated super matrix of all genes in the provided input directory in the specified file formatmatrix_constructor_stats.tsv- a tab separated file with two columns:- Taxon -Unique ID of taxon in the database
- Percent Missing Data - percentage of unoccupied sites within the concatenated matrix.
- a directory
&& - These are standard PREQUAL output files for each gene that PhyloFisher has appended the corresponding gene name to. See the PREQUAL documentation for a thorough explanation of their contents.