Assessing assembly quality in metagenomes of increasing complexity sequenced with HiFi long reads
hal.structure.identifier | Scalable, Optimized and Parallel Algorithms for Genomics [GenScale] | |
hal.structure.identifier | Pleiade, from patterns to models in computational biodiversity and biotechnology [PLEIADE] | |
dc.contributor.author | MAURICE, Nicolas | |
hal.structure.identifier | Pleiade, from patterns to models in computational biodiversity and biotechnology [PLEIADE] | |
dc.contributor.author | FRIOUX, Clémence | |
hal.structure.identifier | Scalable, Optimized and Parallel Algorithms for Genomics [GenScale] | |
dc.contributor.author | LEMAITRE, Claire | |
hal.structure.identifier | Scalable, Optimized and Parallel Algorithms for Genomics [GenScale] | |
hal.structure.identifier | Institut des sciences informatiques et de leurs interactions - CNRS Sciences informatiques [INS2I-CNRS] | |
dc.contributor.author | VICEDOMINI, Riccardo | |
dc.date.issued | 2024-11 | |
dc.date.conference | 2024-11-13 | |
dc.description.abstractEn | Evaluating the quality of metagenome assemblies can be a challenging task, especially when no reference genome is available and when comparing samples at various taxonomic complexity and sequencing depth. A high quality assembly is expected not only to produce high quality bins, but also to be representative of most of the read sequences, especially in complex samples where algorithms struggle reconstructing low-abundance genomes. Recent studies showed a great improvement in number and quality of bins obtained with highly accurate PacBio HiFi long reads. It remains however to be assessed how much of the sample these bins represent, especially in highly complex environmental samples. There is therefore a need to use and compare other evaluation methods. We designed and implemented Mapler, a metagenomic assembly and evaluation pipeline with a primary focus on evaluating the quality of HiFi-based metagenome assemblies. It incorporates state-of-the-art tools for assembly, binning, and assembly evaluation. In addition to classifying assembly bins in classical quality categories according to their marker gene content and taxonomic assignment, Mapler analyzes the alignment of reads on contigs. To do so, it calculates the ratio of mapped reads and bases, and separately analyzes mapped and unmapped reads via their k-mer frequency, read quality, and taxonomic assignment. We compared three metagenomic datasets of increasing complexity, all sequenced using PacBio HiFi technology: a mock community of 21 populations, a human gut microbiome, and a soil microbiome. At equal sequencing depth, for low and medium complexity samples (mock community and human gut, respectively), more than 90% of read sequences are found in bins regardless of the assembler used. In the soil sample, however, not only does the proportion of high quality bins drastically drop, but 54% to 88% of sequenced bases also fail to map to the assembly. We show that considering multiple metrics is important for accurately assessing assembly quality, as relying on a single metric can be misleading. In particular, taking into account both bin quality and read alignment is indispensable, especially in complex environments. Mapler is a pipeline that calculates these metrics with minimal user input, producing both textual and graphical outputs, making it well-suited for evaluating assembly methodologies across a wide range of sample complexities. Mapler is open source and publicly available at https://gitlab.inria.fr/mistic/mapler. | |
dc.description.sponsorship | Computationel models of crop plant microbial biodiversity - ANR-22-PEAE-0011 | |
dc.language.iso | en | |
dc.rights.uri | http://creativecommons.org/licenses/by/ | |
dc.subject.en | Microbiome | |
dc.subject.en | Sequencing | |
dc.subject.en | Metagenome assembly | |
dc.subject.en | Complex ecosystems | |
dc.title.en | Assessing assembly quality in metagenomes of increasing complexity sequenced with HiFi long reads | |
dc.type | Communication dans un congrès | |
dc.subject.hal | Informatique [cs]/Bio-informatique [q-bio.QM] | |
bordeaux.page | 1-21 | |
bordeaux.conference.title | 2024 - 24th Genome Informatics meeting | |
bordeaux.country | GB | |
bordeaux.conference.city | Hinxton | |
bordeaux.peerReviewed | oui | |
hal.identifier | hal-04822870 | |
hal.version | 1 | |
hal.invited | non | |
hal.proceedings | non | |
hal.conference.end | 2024-11-15 | |
hal.popular | non | |
hal.audience | Internationale | |
hal.origin.link | https://hal.archives-ouvertes.fr//hal-04822870v1 | |
bordeaux.COinS | ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.date=2024-11&rft.spage=1-21&rft.epage=1-21&rft.au=MAURICE,%20Nicolas&FRIOUX,%20Cl%C3%A9mence&LEMAITRE,%20Claire&VICEDOMINI,%20Riccardo&rft.genre=unknown |
Fichier(s) constituant ce document
Fichiers | Taille | Format | Vue |
---|---|---|---|
Il n'y a pas de fichiers associés à ce document. |