Contents

Mini-review of 'Compressive Pangenomics Using Mutation-Annotated Networks' (PanMAN)

This is a mini-review (just highlighting some initial thoughts) of this preprint:

Compressive Pangenomics Using Mutation-Annotated Networks
Sumit Walia, Harsh Motwani, Kyle Smith, Russell Corbett-Detig, Yatish Turakhia
https://www.biorxiv.org/content/10.1101/2024.07.02.601807v2

This is an extension of the idea of mutation annotated trees which were successfully applied to SARS-CoV-2 in the UShER (and related) software. Phylogenies are stored as a sequence of mutations on each branch from the root, rather than keeping all sequences. This is not unlikely the use of ancestral recombination graphs to represent genotypes and evolutionary history in e.g. human genomes.

My overall impression was that this was an exciting and novel idea to apply to the pangenomes field. The analysis and benchmarking in the preprint is somewhat limited, and I have some reservations about how applicable this technique is to bacterial poopulations (my main interest). I’m looking forward to seeing where the authors take this next.

Bits I liked

  • The idea of splitting into mutations of homology blocks, then applying SNPs within those is neat, and seems to work well.

  • Another example of ‘annotation-free’ pangenomics – where DNA is simply aligned as a first step – being powerful. PanGraph proves useful again! It’s harder to interpret genes in this framework, but this paper shows you can do other useful tasks.

  • The compression looks very impressive.

  • The native ability to extract phylogeny, whole genome alignments and VCF/GFAs of variation is very useful.

  • The view of recombinations in SARS-CoV-2 (Fig5d) looks nice, although this would presumably become intractable in more frequently recombining populations e.g. bacteria.

  • A detailed technical explanation of many of the code design choices is included.

  • Extensive documentation and looks easy to install.

Questions/concerns

  • SARS-CoV-2 doesn’t really have a pangenome, at least in the sense that it doesn’t have major HGT or gene gain and loss. Indels could count towards one. As far as I know there isn’t major reordering of the genome? Similar for Mtb.

  • Compression of the dataset size is a metric, but not the first one I’d go for. We probably care more about what you can do with the data, and the efficiency of construction. But compression was the only result reported. I would have loved to see benchmarking of the panmanUtils outputs against ‘standard’ phylogenies and pangenome aligners.

  • If looking at compression, what about MBGC or MOF?. Also, an FM-index is about 350x smaller and allows various operations on the data.

  • The comparison to other pangenomic methods could be expanded. PanGraph is good to include, but VG etc aren’t really designed for microbial genomes. What about panaroo and ggCaller? They at least look for mutations so are more comparable to PanMAN.

  • Given the amount of HGT that has occurred in e.g. E. coli I am skeptical that a root or ancestral sequence would be meaningful. Again, some validation/analysis here would be interesting.

  • How does incorrect input tree topology affect the resulting PanMAN?