`nucl_matrix_constructor.py`

Create nucleotide super-matrix from orthologs

nucl_matrix_constructor.py [OPTIONS]

Required arguments:

-i, --input <in_dir> Path to prep_final_dataset_<M.D.Y>

Optional arguments:

-of, --out_format <format> Desired format of the output matrix.
- Options: fasta, phylip (names truncated at 10 characters), phylip-relaxed (names are not truncated), or nexus.
- Default: fasta
-t, --threads N Desired number of threads to be utilized.
- Default: 1
--clean_up Clean up large intermediate files.
-o, --output <out_dir> Path to user-defined output directory
- Default: ./nucl_matrix_constructor_out_<M.D.Y>
-h, --help Show this help message and exit

Default nucl_matrix_constructor.py output:

a directory nucl_matrix_constructor_out_<M.D.Y> that contains:
- a directory cds that contains:
  - {unique_id}.fas - a copy of the input nucleotide fasta file with headers renamed.
  - {unique_id}.fas.nin - an output file from makeblastdb
  - {unique_id}.fas.nhr - an output file from makeblastdb
  - {unique_id}.fas.nsq - an output file from makeblastdb
- a directory tblastn that contains:
  - {gene_name}.{unique_id}.tsv - a tab separated file with the top tblastn hit.
- a directory nucl_seqs that contains:
  - {gene_name}.fas - a FASTA file with the nucleotide sequences of the top tblastn hits.
- a directory logs that contains:
  - a directory makeblastdb that contains:
    - {unique_id}.log - the log file from makeblastdb
  - a directory tblastn that contains:
    - {gene_name}.{unique_id}.log - the log file from tblastn
  - a directory mafft that contains:
    - {gene_name}.log - the log file from mafft
  - a directory trimal that contains:
    - {gene_name}.log - the log file from trimal
- a directory mafft that contains:
  - {gene_name}.aln - output of of MAFFT
- a directory trimal that contains:
  - {gene_name}.final - output of trimAL. Trimmed alignments, in FASTA format, that will be used for concatenation.
- indices.tsv - a tab separated file with three columns outlining the single gene boundaries in the supermatrix:
  1. Gene - name of the gene
  2. Start - first position of the gene within the super matrix
  3. Stop - last position of the gene within the super matrix
- occupancy.tsv - a tab separated file with a column for each gene and a row for each taxon. This file details the presence or absence of each taxon for each gene represented by either a 1 or 0, respectively.
- matrix.<fas | nex | phy> - the concatenated super matrix of all genes in the provided input directory in the specified file format
- matrix_constructor_stats.tsv - a tab separated file with two columns:
  1. Taxon - Unique ID of taxon in the database
  2. Percent Missing Data - percentage of unoccupied sites within the concatenated matrix.