nucl_matrix_constructor.py
Create nucleotide super-matrix from orthologs
nucl_matrix_constructor.py [OPTIONS]
Required arguments:
-i,--input <in_dir>Path toprep_final_dataset_<M.D.Y>
Optional arguments:
-of,--out_format <format>Desired format of the output matrix.- Options: fasta, phylip (names truncated at 10 characters), phylip-relaxed (names are not truncated), or nexus.
- Default: fasta
-t,--threads NDesired number of threads to be utilized.- Default: 1
--clean_upClean up large intermediate files.-o,--output <out_dir>Path to user-defined output directory- Default:
./nucl_matrix_constructor_out_<M.D.Y>
- Default:
-h,--helpShow this help message and exit
Default nucl_matrix_constructor.py output:
- a directory
nucl_matrix_constructor_out_<M.D.Y>that contains:- a directory
cdsthat contains:- {unique_id}.fas - a copy of the input nucleotide fasta file with headers renamed.
- {unique_id}.fas.nin - an output file from
makeblastdb - {unique_id}.fas.nhr - an output file from
makeblastdb - {unique_id}.fas.nsq - an output file from
makeblastdb
- a directory
tblastnthat contains:- {gene_name}.{unique_id}.tsv - a tab separated file with the top tblastn hit.
- a directory
nucl_seqsthat contains:- {gene_name}.fas - a FASTA file with the nucleotide sequences of the top tblastn hits.
- a directory
logsthat contains:- a directory
makeblastdbthat contains:- {unique_id}.log - the log file from
makeblastdb
- {unique_id}.log - the log file from
- a directory
tblastnthat contains:- {gene_name}.{unique_id}.log - the log file from
tblastn
- {gene_name}.{unique_id}.log - the log file from
- a directory
mafftthat contains:- {gene_name}.log - the log file from
mafft
- {gene_name}.log - the log file from
- a directory
trimalthat contains:- {gene_name}.log - the log file from
trimal
- {gene_name}.log - the log file from
- a directory
- a directory
mafftthat contains:{gene_name}.aln- output of of MAFFT
- a directory
trimalthat contains:{gene_name}.final- output of trimAL. Trimmed alignments, in FASTA format, that will be used for concatenation.
indices.tsv- a tab separated file with three columns outlining the single gene boundaries in the supermatrix:- Gene - name of the gene
- Start - first position of the gene within the super matrix
- Stop - last position of the gene within the super matrix
occupancy.tsv- a tab separated file with a column for each gene and a row for each taxon. This file details the presence or absence of each taxon for each gene represented by either a 1 or 0, respectively.matrix.<fas | nex | phy>- the concatenated super matrix of all genes in the provided input directory in the specified file formatmatrix_constructor_stats.tsv- a tab separated file with two columns:- Taxon - Unique ID of taxon in the database
- Percent Missing Data - percentage of unoccupied sites within the concatenated matrix.
- a directory