Homolog Collection
-
Collect putative homologs from input taxa using
fisher.py
fisher.py [OPTIONS]
Optional arguments:
-t
,--threads <N>
Number of threads, where N is an integer- Default : 1
-
-n
,--max_hits <N>
Max number of hits to collect, where N is an integer- Default: 5
-
--keep_tmp
Keep temporary files -
--all_BBH
Keep all significant BLAST hits regardless of phylogenetic affiliation as determined with FastTree through the phylogenetically aware route -
--add <input_metadata.tsv>
Path to input metadata file(different from original the one inconfig.ini
) that contains new input proteomes. Must be used with–add_to
option below. -
--add_to <fisher_out>
Previous fisher.py output directory to add to. Must be used with the--add
option above. -
-o
,--output <out_dir>
Path to user-defined output directory- Default:
./fisher_out_<M.D.Y>
- Default:
-h
,--help
Show this message and exit.
Default fisher.py output:
-
a directory called
fisher_out_<M.D.Y>
that contains:-
fasta files of orthologs from the database with putative homologs from newly input taxa appended to the end of the fasta file. NOTE: if no putative homologs were found in any newly input proteome for a particular ortholog in the database, that ortholog will not be present in this directory. All files will have the extension “.fas”.
-
a tab delimited file called
orgininal_names.tsv
with the original sequence headers from the input proteomes that were clustered by CD-HIT in fisher.py in the first column and a new “clean” name assigned byfisher.py
made up of the chosen Unique ID and a numbered 1 to N.
-
-
a file
noncorresponding_hits.txt
that contains:- information regarding collected sequences whose best BLAST hit was not the corresponding ortholog in the database. (See fisher algorithm details for more information on this criterion). This file will NOT appear if the best BLAST hit of each retained sequence is the corresponding ortholog from the database.
NOTE: Users that plan to collect homologs from more taxa prior to single gene tree construction should stop here for now.
-
Running more rounds of addition prior to single gene tree construction (OPTIONAL).
fisher.py [OPTIONS] --add <Add_More_Taxa.tsv> --add_to <previous_fisher_out_dir>
Use
-–add
option in combination with the--add_to
option offisher.py
to include homologs from more taxa in another round of addition to the working dataset. Create a new file identical toinput_metadata.tsv
in structure except with a different name (Ex.Add_More_Taxa.tsv
). Put this new file in the project directory created in STEP 1. The information in this new text file will be appended to input_metadata.tsv.NOTE: We recommend checking that the information contained in the new input file was added to the end of
input_metadata.tsv
after the new addition run is complete. If so, delete the new input file after addition is complete to avoid unnecessary clutter in the directory
-
Produce preliminary statistics about newly input data.
informant.py [OPTIONS] -i <input_directory>
Required arguments:
-i
,--input <input_dir>
Path tofisher.py
output directory
Optional arguments:
-
--orthologs_only
Paralogs will NOT be included from any taxa in the database in downstream single gene tree construction. -
-h
,--help
Show this help message and exit
At this step the statistics provided by informant.py should be considered preliminary. The number of orthologs shown for a newly input taxon can be inflated due to contamination in the data or if
fisher.py
harvested only a paralog for a particular gene (If only one sequence passes all criteria it is automatically marked the putative ortholog but is not always the true ortholog). These statistics will almost always change after manual curation. However, they can be generated quickly and are useful for a rough estimate of data quality.Default
informant.py
output (found in thefisher.py
output directory provided toinformant.py
via the-i
,--input
options):-
directory called
informant_stats
that contains five files:-
db_taxa_stats.tsv
- a comma separated file with taxa as rows and seven columns:- Unique ID - short name of organism in the database
- Long Name - full name of organism in the database
- Higher Taxonomy - highest taxonomic level assigned to the taxon in
metadata.tsv
- Lower Taxonomy - lower taxonomic level assigned to the taxon in
metadata.tsv
- Genes collected out of “N”- number of genes present in taxa already in the database where N is the number of genes found by
fisher.py
for all newly added taxa - SGT - will allow for the inclusion (YES) or exclusion (NO) of taxa in single gene tree construction on a per taxon basis
- Paralogs - will allow for the inclusion (YES) or exclusion (NO) of paralogs on a per taxon basis for single gene tree construction. If no paralogs exist for an organism in the database this column will be labelled “none”. If the
--orthologs_only
flag was used, columns for all taxa that have paralogs in the database will be marked “NO.”
genes_stats.tsv
- a comma separated file with gene names from the database as rows and four columns. Only genes that had a sequence from at least one newly added taxon appended to the alignment are included.- Gene Name - gene name in the database
- Number of Taxa - number of taxa which the gene is present
- Percent of Total Taxa (out of “N”) - percentage of taxa which have the gene where N is the total number of taxa.
- SGT - include gene in gene tree construction step. To be used by
working_dataset_constructor.py
downstream
-
new_taxa_stats.tsv
- a comma separated file with taxa as rows and ten columns:- Unique ID - short name of organism in the database
- Long Name - full name of organism in the database
- Higher Taxonomy - highest taxonomic level assigned to the taxon in metadata.tsv
- Lower Taxonomy - lower taxonomic level assigned to the taxon in metadata.tsv
- Genes collected out of “N”- number of genes present in newly added taxa where N is the number of genes found by
fisher.py
for all newly added taxa - Sequences collected - number of total sequences collected by
fisher.py
for a taxon. - #SBH - number of collected sequences for a taxon where the specific query used produced a significant hit from the input proteome and the collected sequence WAS sister to a sequence already present in the database of the same Higher Taxonomy during FastTree analysis performed automatically by
fisher.py
- #BBH - number of collected sequences for a taxon where the specific query used produced a significant hit and the collected sequence WAS NOT sister to a sequence already present in the database of the same higher taxonomy during FastTree analysis performed automatically by
fisher.py
- #HMM - number of collected sequences for a taxon that were selected by the default HMM route.
- SGT - will allow for the inclusion (YES) or exclusion (NO) of taxa in gene tree construction on a per taxon basis
–
occupancy.tsv
- a comma separated file with taxa as rows and genes in the starting dataset as columns. A “0” (zero) indicates a gene is absent in a taxon. Any non-zero number is the number of sequences collected for that “gene” for a taxon.–
occupancy_with_routes.tsv
- a comma separated file with taxa as rows and genes in the non-zero number is the number of sequences collected for that “gene” for a taxon. The abbreviation for the route throughfisher.py
that each gene was collected (either SBH,BBH, HMM) by is appended to
-
-
Collect taxa, and homologs for gene tree construction
working_dataset_constructor.py [OPTIONS] -i <input_directory>
Required arguments:
-i
,--input <input_dir>
Path to output directory offisher.py
that contains the output ofinformant.py
Optional arguments:
-
-o
,--output <out_dir>
Path to user-defined output directory- Default:
./working_dataset_constructor_out_<M.D.Y>
- Default:
-
-h
,--help
Show this help message and exit
NOTE: To exclude genes, newly added taxa, or taxa already in the database, from the working dataset used for gene tree construction (NOT RECOMMENDED) change the column values for the column “SGT” in
gene_stats.tsv
,new_taxa_stats.tsv
,db_taxa_stats.tsv
respectively to “NO.” To exclude previously identified paralogs for taxa change the value for “Paralogs” indb_taxa_stats.tsv
to “NO.”Default
working_dataset_constructor.py
output:-
A directory
working_dataset_constructor_output_<M.D.Y>
that contains:- fasta files of genes to be included in single gene tree construction. This will be either all or the chosen subset of the gene files contained in the input directory. Each gene file contains only taxa to be included in single gene tree construction.
-