Correcting modification-mediated errors in nanopore sequencing by nucleotide demodification and reference-based … – Nature.com

Posted: November 30, 2023 at 8:35 pm

Unusual low-quality ONT genomes due to extensive modifications

We sequenced 12 microbial strains of Listeria monocytogenes using Illumina and ONT R9.4 flowcells (~200990Mbp, SUP model) (Fig.1a, Supplementary Tables1 and 2). The ONT reads were assembled into genomes with sequencing errors further polished by Medaka and Homopolish (Supplementary Table3, see Methods). The Illumina and ONT read were hybrid assembled for evaluation purposes (Supplementary Table4). When compared with the Illumina/ONT hybrid assemblies (Fig.1b), seven ONT-only genomes exhibited high quality (HQ) ranging from Q47 to Q60 (e.g., R19-2905 and R20-0088). However, five isolates (R20-0026, R20-0030, R20-0127, R20-0148, and R20-0150) showed unexpectedly low quality (LQ) varying from Q26 to Q32. The accuracy of these five LQ genomes remained unimproved after replicated ONT sequencing. Further investigation of the five LQ genomes revealed excessive amounts of mismatch errors (15335670) compared with the seven HQ ones (040 mismatches) (Fig.1c). Homopolymer errors (i.e., indels) were not the source of inferior quality (7306, Supplementary Table5).

a Workflow of ONT-only and ONT/Illumina hybrid assembly; b Q scores; c number of mismatches (red: LQ, gray: HQ); d comparison of ONT and Illumina reads by IGV; e numbers of 5mC, 6mA, and mismatches between HQ/LQ strains (n=12, red: LQ, gray: HQ). Error bars represent the minimum and maximum values.

Manual inspection revealed that these mismatches were ONT basecalling errors uncorrected after genome polishing (Fig.1d and Supplementary Fig.1). As mismatch errors in ONT are mainly due to epigenetic modifications, we computed the frequency of well-known methylation in these isolates (see Method and Supplementary Table6). In terms of 5-methylcytosine (5mC), the numbers of modified loci in the five LQ genomes (~240340k) were not significantly higher than those in the HQ ones (210345k, P=0.89, Fig.1e). Similarly, the numbers of N6-methyladenine (6mA) modifications also showed no significant difference between the LQ and HQ groups (98218k vs. 126223k, P=0.34). Because the numbers of mismatch errors in LQ genomes are significantly higher than those of HQ ones (P=0.005), we suspected ONT basecalling algorithms failed to distinguish the novel modification types in the LQ isolates.

We removed the modifications in all microbial samples by WGA (Fig.2a), which randomly amplifies the genome fragments without retaining any epigenetic modification (see Methods). The WGA-demodified samples were sequenced by ONT (R9.4), assembled into chromosomes, and compared with the Illumina/ONT hybrid genomes (Fig.2a, Supplementary Tables7 and 8). The five LQ genomes after WGA exhibited significantly higher quality than those without demodifications (e.g., Q26 to Q53 in R20-0026) (Fig.2b, Supplementary Table9). In particular, the amounts of mismatch errors significantly reduced after demodification (e.g., 5670 to 16 in R20-0026) (Fig.2c). Consequently, the unexpected low quality of ONT was due to excessive modification-induced errors untrained in their basecalling model. The demodification by WGA can produce high-quality ONT genomes without the need for Illumina short reads.

a Worflow of WGA-demodified ONT; b Q scores of the WGA-demodified and ONT-only genomes (gray: ONT, black: WGA ONT); c numbers of mismatches of the WGA-demodified and ONT-only genomes (gray: ONT, black: WGA ONT); d WGA and ONT-only genome quality with respect to sequencing depth (shading: mininum and maximum quality in five replicates, line: median quality); e numbers of active/available pores during WGA-demodified and ordinary ONT sequencing.

However, while WGA successfully erased these modifications, the sequencing cost increased by two factors. First, WGA required a higher sequencing depth (~100) for assembling a complete genome when compared with ordinary ONT sequencing (~30) (Fig.2d and Supplementary Figs.2 and 3). It was due to the uneven amplification of WGA, which led to non-uniform sequencing depth and a fragmented assembly at moderate coverage. Second, the WGA-demodified samples may reduce the ONT yields. We observed the number of available/active pores could sometimes decrease quickly (e.g., less than 100 pores after 12h) (Fig.2e), which was possibly owing to the hyperbranched structure unresolved after WGA10. Consequently, the sequencing cost of WGA-demodified samples using ONT is much higher than ordinary sequencing.

We developed a novel computational method (called Modpolish) for correcting these modification-mediated errors without WGA and prior knowledge of the modification systems. Modpolish identifies and corrects the modification-mediated errors by leveraging basecalling quality, basecalling consistency, and evolutionary conservation (Fig.3a, see Methods). Briefly, because the ONT signals are disturbed by modifications, the basecalling quality is substantially lower than the modification-free loci (Supplementary Fig.4). As such, the basecalled nucleotides are often inconsistent at the modified loci (Supplementary Fig.5), yet these loci are within conservative motifs (Supplementary Fig.6). In conjunction with the conservation degree measured by closely-related genomes, only the modified loci with ultra-high conservation will be corrected by Modpolish, avoiding false corrections of strain variations with high specificity.

a Workflow of Modpolish; b Q scores before and after Modpolish; c numbers of mismatches before and after Modpolish (gray: before Modpolish, black: after Modpolish); d the antiviral defending systems encoded by the 12 strains (gray: before Modpolish, black: after Modpolish); e the sequence motif of modification sites in the four mza-encoding strains; f the sequence motif of modification sites on the R20-0026 strain.

We assessed the accuracy of Modpolish by comparing the quality of the ONT-only genomes (polished by Medaka) with those further polished by Modpolish. The results indicated that Modpolish significantly improved the quality of all LQ genomes from Q2734 to Q60 (Fig.3b, Supplementary Table10). The number of mismatches also greatly decreased (e.g., from 5670 to 67 in R20-0026) (Fig. (3c). The numbers of mismatches in some HQ genomes were also reduced by Modpolish. For instance, the mismatches in the R19-2905 were erased from 40 to 6. Consequently, our results suggested that Modpolish made no false corrections on the HQ genomes (Supplementary Tables1113). The comparison of different basecaller versions and models (v4.0.14 vs. v6.3.4, HAC vs. SUP) indicated that these errors remain exist and Modpolish successfully erases most of them (Supplementary Fig.7).

As the modification systems often involve anti-phage defense (e.g., R-M, BREX, DISARM)11,12,13, we investigated the defending systems possessed by the HQ and LQ strains (Fig.3d) (Supplementary Data1). All the HQ genomes encompass at least one R-M system (e.g., Type I, II, or III), which is missing in all LQ isolates. Instead, four LQ strains (i.e., R20-0030, R20-0127, R20-0148, R20-150) carry a novel methyltransferase-encoding mza defending system which is absent in all HQ genomes (Supplementary Fig.8). Analysis of modification sites of the four mza-encoding LQ strains revealed pentanucleotide motif GCAGC (Fig.3e, Supplementary Fig.6). On the other hand, modification loci in the LQ R20-0026 all centered on the motif GCTGG (Fig.3f). Together, these results suggested that two lineage-specific modification systems extensively edited the five LQ genomes. Although their underlying mechanisms remained unclear, the editing at specific motifs with high conservation within each lineage allowed cost-effective in silico correction of these errors by Modpolish.

We then assessed the performance of Modpolish on public ONT datasets sequenced by R9.4 (SUP) and R10.4 flowcells (SUP, duplex/simplex modes). In the R9.4 dataset14, we first compared the quality of seven bacterial genomes polished by Medaka and Modpolish (Fig.4a, Supplementary Table14). The quality of five genomes significantly improved from ~Q45 to Q60. Similarly, the improvement was mainly due to the reduction of mismatches (Fig.4b). For instance, the number of mismatches decreased from 388 to 13 in the Staphylococcus genome after Modpolish. On average, the mismatch reduction rates of all genomes ranged from 50-96%. Consequently, although these bacterial genomes are not extensively modified, Modpolish can further improve their quality after Medaka without false corrections.

Comparison of Medaka and Modpolish for a Q scores and b mismatches on the R9.4 dataset; comparison of Medaka and Modpolish for c Q scores and d mismatches on the R10.4 dataset.

In the R10.4 (duplex mode) dataset3, we compared the genome qualities polished by Medaka and Modpolish (downsampled to ~60) (Fig.4c, Supplementary Table15). In general, Modpolish made little or no improvement in the duplex dataset. For instance, the mismatches produced by Modpolish only reduced from 20 to 19 on the Bacillus genome (Fig.4d). The overall genome quality is very high such that no differences can be seen (Q60). Modpolish demonstrated marginal on a recently published simplex dataset (R10.4, kit 14, Dorado v0.1.1) (Supplementary Fig.9). Therefore, the qualities of ONT R10.4 flowcells, in particular the duplex mode, is not only higher than those of R9.4 and require nearly no further correction. On the other hand, Modpolish may be used to fill the accuracy gap between simplex and duplex modes when the projects aim for higher throughput.

View original post here:
Correcting modification-mediated errors in nanopore sequencing by nucleotide demodification and reference-based ... - Nature.com

Related Posts