taxmodule was developed with Titus Brown, Taylor Reiter, Luiz Irber, and Hannah Houts. The approaches described have emerged over years of
sourmashdevelopment and exploration by these authors as well as Phil Brooks, Bry Moore, Mo Abuelanin, and a number of other key
sourmashcontributors. We’re preparing preprints to properly describe this functionality, but for now please engage with us on gitter, github, twitter, etc!
v4.2, we introduced a new module,
sourmash tax (alias:
taxonomy) for integrating taxonomic information into
sourmash gather results. With this addition, the
sourmash tax approach supercedes the
sourmash lca approach as our recommended method for taxonomic integration.
sourmash gather finds the minimal set of reference genomes that match your query
The approach described here relies on
Scaled Minhash sketches, which are comprised of a chosen proportion of k-mers from an input sequence dataset, rather than a static chosen number of k-mers, as in standard MinHash. While this means
Scaled MinHash sketch size scales with dataset size, downsampling sequencing datasets in this way enables estimation of
containment, which facilitates comparisons between datasets of very different sizes (e.g. trying to figure out which genomes are in your metagenome). You can find a bit more on
Scaled MinHash in our sourmash use cases paper (2019).
sourmash gather is an approach that uses
containment to find the smallest set of reference genomes that contain all k-mers in your query sketch. Titus gave an excellent talk on
sourmash gather recently (slides), and the approach is described in detail in Luiz Irber’s dissertation. Luiz’s benchmarking suggests that gather is a very accurate method for doing strain-level resolution of genomes, but I’ll leave a full description for our preprint (hopefully out by year-end :).
As the exact strains from novel sequencing projects are often not present in reference databases,
gather results commonly include matches to multiple strains of the same species, if there are multiple strains present in the reference database. These matches represent the portions of our novel query strain(s) that share genomic content with strains in the reference database.
sourmash tax annotates gather results and conducts LCA summarization
To connect to all the research that has been done to describe organisms at the species level (and above), we wanted to be able to integrate taxonomic lineage information to the
gather-reported genome matches. Enter
sourmash tax, which adds taxonomic integration and optionally uses a modified lowest common ancestor (LCA) approach to summarize results to higher taxonomic ranks.
How is this different from current k-mer LCA methods?
The main innovation of the
tax method is that LCA aggregation occurs at the genome level, or really at the “groups of k-mers” level, rather than at the individual k-mer level.
Single k-mers, either for true biological reasons or as a result of reference inaccuracies, can sometimes match to quite different organisms and derail the individual k-mer LCA approach (nicely laid out in this twitter thread):
Consider the case where you have 20 hits, 19 of which are to the Genus Prevotella, and 1 of which is to the Archaea Methanobrevibacter.— Mick W@tson (@BioMickWatson) July 30, 2021
Most people would say "Prevotella!" but the lowest common ancestor would be "cellular organisms" - this is not useful
In our experience, aggregating k-mer matches to the set of best genome matches with
gather prior to LCA means contamination and other reference issues are far less likely to impact LCA annotations. As an added benefit, this approach allows us to do taxonomic assignment with multiple taxonomies (e.g. GTDB if available, NCBI for the remaining), so long as the
gather was run against the appropriate databases.
How does it work?
This approach relies upon the fact that
gather reports both the total fraction of the query matched to each reference genome, as well as a non-overlapping
f_unique_to_query, which is the fraction of the query uniquely matched to each reference genome. The
f_unique_to_query for any reference match will always be between 0 (0% of query matched) and 1 (100% of query matched). For query k-mers that match more than one genome,
gather iteratively assigns these k-mers to their maximal (best containment) genome match in order to report the non-overlapping
For a query matched to multiple references, the
f_unique_to_query of all of its reference matches will sum to at most 1 (100% of query matched). We use this property to do LCA aggregation of
gather results. For example, if the
gather results for a metagenome include matches to 30 different strains of a given species, we can sum the fraction uniquely matched to each strain to obtain the total fraction uniquely matched to this species.
Note that gather results themselves are still affected by the completeness of the reference database, as are all reference-based analyses. Long DNA k-mers are quite specific, so if you get a match, it’s pretty reliable. However, if you don’t get a match, it may mean that the species or genus you’re querying may not be closely related enough to anything in the reference database. We’ve been exploring protein k-mers to overcome some of these challenges and enable matching across larger evolutionary distances. I’ll go into detail in my next blog post!
sourmash tax commands
Two commands in
sourmash tax use our LCA approach:
genome. We hope the intended input query for each of these methods is self-explanatory :).
For all methods, you’ll need to have sketched your query or queries, and run
sourmash gather against a reference database. You can find pre-prepared
GTDB reference databases here. Second, you will need a file containing the taxonomic information for all genomes in your reference database (
gtdb-rs202 file here).
sourmash tax metagenome summarizes gather results for each query metagenome by taxonomic lineage, and (by default) reports the results at each taxonomic rank.
sourmash tax genome reads gather results for each query genome and attempts to classify to the taxonomic lineage of the best gather result. Our current approach uses a threshold of 10% query coverage before assigning a classification, meaning that if an insufficient portion of our query is matched at the species level, we will use the LCA lineage summarization approach above to assess the match at each rank, and classify to the rank where our classification threshold is met. Alternatively, classification can be conducted at a specific taxonomic rank, regardless of classification threshold. This method is our newest addition, and we welcome feedback and results to help improve it going forward.
Note we also offer
krona output from
metagenome or rank-classified
genome results, as well as a
lineage_summary output for
tax metagenome that may be useful for external multi-sample visualization tools.
There are two additional commands within the
sourmash tax submodule,
tax prepare is a method for converting a
csv of taxonomic lineage information into an
sqlite database to enable faster loading and lineage assignment. It can also be used to combine lineage information for more that one database (e.g. GTDB, NCBI).
tax annotate annotates
gather results with taxonomic information, without doing any LCA summarization.
What about the
sourmash lca methods?
sourmash lca submodule uses the LCA approach pioneered by Kraken and used by many other tools. It uses taxonomic information to classify individual k-mers, and then infers taxonomic distributions of metagenome contents from the presence of these individual k-mers. The
lca subcommands work with LCA databases, which contain taxonomic information by design, and are pretty neat for other reasons too.
While we recommend using the
tax method, the
lca module isn’t going anywhere, and many may still find it interesting or useful.
Please see the documentation for additional details, and contact us on gitter with questions and comments, or file an issue on our github. We’d love to hear if and how these methods are useful to you!