Genome-centric analysis of short and long read metagenomes reveals uncharacterized microbiome diversity in Southeast Asians – Nature.com

Posted: October 15, 2022 at 5:22 pm

Subject recruitment

Subjects for this cross-sectional study were recruited based on recall from a community-based multi-ethnic prospective cohort27 that is part of the Singapore Population Health Studies project (SPHS -formerly Singapore Consortium of Cohort Studies). This subset included 109 subjects who were 48 to 76 years old with 65 males and 44 females (Supplementary Data1). Subjects in SPHS were recruited to participate in the National Health Survey, where subjects were selected at random using age- and gender- stratified sampling to obtain a representative sample set of residents in the country. During the period of recruitment from April 16th, 2008 to September 20th, 2018, subjects did not have any pre-existing major health conditions (cardiovascular disease, mental illness, diabetes, stroke, renal failure, hypertension and cancer) based on self-reporting27. The ethnicity of each subject was confirmed verbally so that all four grandparents of the subject belonged to the same ethnic group. As such, we do not anticipate that any self-selection bias was introduced. A separate comparison of baseline clinical measurements was performed, including age-adjusted BMI and HbA1c, against the rest of the subjects in the larger ethnicity-specific cohorts within Singapore Population Health Studies to ensure that the sampling for the initial cohort conformed to population norms. Informed consent was obtained from all participants. Each subject was given 60 Singapore Dollars for their participation in this study. All associated protocols for this study were approved by the National University of Singapore Institutional Review Board (IRB reference number H-17-026) on May 9th, 2017 and renewed until May 31st, 2021.

Fecal samples were collected from healthy subjects using the BioCollectorTM kit (The BioCollective, Colorado, USA). Samples were double-bagged and transferred to a polystyrene box, together with a pre-chilled ice-pack (20C). The polystyrene box was transferred to a cardboard box and later collected from the participants home within the same day. Samples brought to the Temasek Life Sciences laboratory were stored into an anaerobic chamber (atmosphere of N2 (75%), CO2 (20%), and H2 (5%)). Fecal samples were homogenized and subsamples transferred into sterile 2mL centrifuge tubes.

Genomic DNA was extracted from fecal material (0.25g wet weight) using the QIAamp Power Fecal Pro DNA kit (QIAGEN GmbH, Cat. No. 51804) and was quantified using Qubit dsDNA BR Assay Kit (Thermo Fisher Scientific, Cat. No. Q32853). Integrity of the extracted DNA was verified using 0.5% agarose gel electrophoresis.

Metagenomic libraries were prepared with a standard DNA input of 50ng across all samples, using NEBNext Ultra II FS DNA Library Prep Kit for Illumina (New England Biolabs, Cat. No. E7805), according to the manufacturers instructions. The reaction volumes were, however, scaled to a quarter of the recommended volumes for cost effectiveness. Barcoding and enrichment of libraries was carried out using NEBNext Multiplex Oligos for Illumina (96 Unique Dual Index Primer Pairs; New England Biolabs, Cat. No. E6440). Paired-end sequencing (2151bp reads) was carried out on the Illumina HiSeq4K platform with a minimum and average depth per sample of 2.4Gb and 9.4Gb respectively.

Purity and integrity of DNA was assessed and ensured to fall within recommended ranges before library preparation. To preserve the integrity of DNA, the shearing step was omitted and DNA was used directly for DNA repair and end-prep. Single-plex libraries were prepared using 1D sequencing kit (Oxford Nanopore Technologies, SQK-LSK108 or SQK-LSK109) according to the 1D Genomic DNA by ligation protocol. For samples that were multiplexed (12-plex), the native barcoding kit (Oxford Nanopore Technologies, EXP-NBD103 or EXP-NBD104 and EXP-NBD114) was used and libraries were prepared according to the Native barcoding genomic DNA protocol. Both native barcode ligation and adapter ligation steps were extended to 30min instead of 10min. Single-plex samples were sequenced on either the MinION or GridION machine with either FLO-MIN106D or MIN106 revD flowcells. Multiplex samples were sequenced on the PromethION machine with FLO-PRO002 flowcells. Raw reads were basecalled with the latest version of the basecaller available at the point of sequencing (Guppy v3.0.4 to v3.2.6). Basecalled nanopore reads were demultiplexed and filtered for adapters with qcat (v1.1.0 https://github.com/nanoporetech/qcat). The minimum and average sample depth was 1.2 and 4.7Gb respectively. Number of reads ranged from 300,000 to 3.4 million (average=1.4 million).

Hi-C libraries were generated using Phase Genomics ProxiMeta kit (version 3.0), based on the standard protocol. Briefly, 500mg fecal material was crosslinked for 15min at room temperature with end-over-end mixing in 1mL of ProxiMeta crosslinking solution. Once crosslinking reaction was terminated, quenched fecal material was rinsed. Sample was resuspended and a low-speed spin was used to clear large debris. Chromatin was bound to SPRI beads and incubated for 1h with 150L of ProxiMeta fragmentation buffer and 11L of ProxiMeta fragmentation enzyme. Once washed, beads were resuspended with 100L of ProxiMeta Ligation Buffer supplemented with 5L of Proximity ligation enzyme and incubated for 4h. After reversing crosslinks, the free DNA was purified with SPRI beads and Hi-C junctions were bound to streptavidin beads and washed to remove unbound DNA. Washed beads were used to prepare paired-end deep sequencing libraries using ProxiMeta Library preparation reagents. Paired-end sequencing (2151bp reads) was carried out on the Illumina HiSeq4K platform. The minimum and average sample depth was 2.3 and 24.5Gb respectively.

Sequencing costs can vary substantially across sequencing centers and countries. Here we provide an estimate based on costs at the Genome Institute of Singapore in November 2021. Based on prices for library preparation kits as described in this manuscript, we estimate that Illumina library preparation costs ~US$50/sample and an Illumina HiSeq sequencing lane costs ~US$1000 with approximate throughput of >350 million paired-end reads (2151bp; >100Gbp). Considering that the average Illumina sequencing depth per sample in this study is ~10Gb, 10 samples can be multiplexed in a single lane, leading to the overall cost per sample of ~US$150. For ONT sequencing, we estimate that with an approximate flow-cell price of US$500 producing ~30Gbp of sequencing data, 5 samples can be multiplexed to obtain the average throughput in this study (~6Gbp). With ONT multiplexed library preparation costs of ~US$50/sample, we estimate that overall ONT costs are also ~US$150/sample. Metagenomic assembly of Illumina and Hybrid datasets with MEGAHIT and OPERA-MS, respectively, typically took less than 3h on an AWS C5 instance with 8 CPUs. Using as reference an AWS C5 instance price of 30 cents an hour for 8 CPUs, this translated to a computational cost of

Illumina and ONT read statistics were generated with Fastq-Scan (v0.4.1, https://github.com/rpetit3/fastq-scan) and NanoStat53 (v1.4.0), respectively. To assess taxonomic concordance, Illumina and ONT reads were classified with Kraken254 (v2.1.1, UHGG database13) and relative abundances were estimated with Bracken55 (v2.6.1) at the species level (option -l R7) to compute Pearson correlation coefficients per sample.

Illumina reads were assembled using MEGAHIT8 (v1.04, default parameters) and hybrid metagenomic assemblies were generated with Illumina and ONT data using OPERA-MS25 (v0.9.0, --polish). Contigs were binned with MetaBAT210 (v2.12.1, default parameters). Hi-C binning was provided by Phase Genomics using its internal pipeline with MetaBAT results for hybrid assemblies as a starting point. Assembly bins were evaluated based on MIMAG standards28, with contamination, completeness and N50 values determined with CheckM56 (v1.04), and non-coding RNA annotations from barrnap (https://github.com/tseemann/barrnap) (v0.9) and tRNAscan-SE57 (v2.0.5, default parameters). Assembly bins with contamination <10% and completeness >50% were designated as medium quality MAGs, those with contamination <5% and completeness >90% as near complete MAGs, and additionally near complete MAGs with complete 5S, 16S, and 23S rRNA genes and at least 18 unique tRNA genes were classified as high quality MAGs. All other bins were classified as low quality and were removed from further analyses. In total, 4497 medium quality, near complete and high quality MAGs were designated as being part of the SPMP database. Hybrid and short-reads assembly based MAGs were further assessed for chimerism with GUNC58 (v1.0.4, detailed output). Coding sequence lengths obtained from Prodigal59 (v2.6.3) calls were compared between the two datasets to assess the potential impact of long read indel errors on gene annotation. Concordant with prior work showing that hybrid metagenomic assemblies can have high base-pair accuracy25, we also noted that SPMP MAGs independently assembled from distinct individual gut metagenomes could exhibit high average nucleotide identity (>99.99%, consistent with Q40 quality).

Representative MAGs for SLCs were used to create a custom Kraken60 (v2.1.1) database (https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases) and relative abundances for SLCs were estimated for each sample using Bracken55 (v2.6.0, default parameters). Rarefaction analysis for estimating overall species diversity was done using the R package iNext61 (v2.1.7, q=0, datatype=incidence_raw and endpoint=300), based on converting SLC relative abundance values from Bracken into presence-absence values at a threshold of 0.05%.

Genus-level abundances for each sample were provided as input for R package MaasLin236 (v1.4.0) along with sample metadata (age, sex and ethnicity), and significant associations were determined by combining 3 MaasLin2 runs with a compound Poisson linear model.

Metagenomic reads were mapped (--secondary=no) against reference databases indexed with minimap262 (v2.24-r1122, -I 24G; SPMP strain-level genomes and UHGG species-level representatives). Alignments were filtered at the strain-level with bamtools (v2.5.2, -tag NM:<2 -length >99) and unique reads were extracted based on samtools (v1.15.1) view results.

To further evaluate the utility of SPMP genomes relative to the UHGG database for read mapping at the strain-level, we created databases with similar number of strains from both collections. Reference indexing and mapping were done in a similar fashion as described before. Alignments were filtered with pysam (v0.19.1) (read coverage 90%, identity 99%), and reads were classified at the species-level with Kraken (v2.1.1, RefSeq bacteria database). Specifically, we identified 21 species with many strain genomes in UHGG or SPMP (20) and having enough reads (>10 coverage) in at least 3 samples in an independent study of Singaporean gut metagenomes35. Illumina reads were mapped (minimap2, default parameters) independently to strain genomes for each species. Kraken2 classification (standard database) was used to assess if mapped reads came from the right species, and to calculate precision, sensitivity and F1 scores. We noted that median F1 scores were better using SPMP compared to UHGG for 17 out of 21 species. Overall, SPMP provided significantly better mapping performance (F1 score) relative to UHGG for 12 species (Wilcoxon p<0.05). The converse, i.e., significant improvements with UHGG relative to SPMP, were not observed for any species. Improvements in F1 scores were driven by better sensitivity in SPMP vs UHGG for abundant gut bacterial species such Prevotella copri and Alistipes onderdonkii. While median precision scores using SPMP and UHGG were similar (0.98 vs 0.99 for P. copri; 0.98 vs 0.97 for A. onderdonkii), sensitivity was notably higher in SPMP vs UHGG (0.96 vs 0.90 for P. copri; 0.99 vs 0.90 for A. onderdonkii).

The SPMP database was compared to the GTDB database2 (release 95) using GTDBtks63 (v1.4.1) ani_rep command with default arguments, which leverages Mash64 (v2.3) to provide pairwise genome-wide similarity values between all query MAGs and GTDB sequences. Only pairs with Mash distance 0.05 were retained and used to define the best match for each SPMP MAG based on minimum Mash distance. GTDB matches were classified based on their metadata as being uncultivated (derived from environmental sample or derived from metagenome) or based on isolate strains. Both N50 values and MIMAG classifications were extracted from GTDB metadata. MAGs were placed into a phylogenetic tree using GTDB_TK (v1.4.1) with classify_wf (default options), based on pplacer_taxonomy values. To assess novelty in light of the latest human gut metagenome database, we further compared our MAGs to the 5414 representative genomes from the Human Reference Gut Microbiome catalog (HRGM)22 with a similar Mash analysis (Supplementary Data6).

MAGs were clustered at the species (95%) and strain-level (99%) based on average nucleotide identity estimates (ANI; using Mash with sketch size of 10k and k-mer size of 21bp) with agglomerative clustering (sklearn v0.23.2, AgglomerativeClustering function, options: linkage=single, n_clusters=None, compute_full_tree=True, affinity=precomputed). For each cluster, representative MAGs were defined using the highest eigen centrality value based on a weighted network graph produced by networkx (v2.5; eigenvector_centrality function). Strain-level clustering was done jointly with all species-level matches from the UHGG database (v1.0, ANI threshold of 95%). Phylogenetic analysis at the strain-level was conducted using the biopython Phylo package65, based on pairwise distances generated with FastANI66 (v1.32). Phylogenetic trees were visualized using FigTree (tree.bio.ed.ac.uk/software/figtree).

SLCs were assigned putative species name and types based on comparisons with multiple databases, including GTDB, Pasolli et al.67 (SGB) and Almeida et al.13 (UHGG). SLCs types were defined as, (i) isolate: if GTDB match to an isolate was found (Mash distance 0.05), (ii) uncultivated: if a match to any database was found, but no isolates, (iii) novel: if no matches were found. SLCs were assigned putative species names based on a majority rule for MAGs in the cluster, with preference for GTDB ids (Supplementary Fig.9).

Biosynthetic gene clusters (BGCs) in the SPMP database were identified using antiSMASH68 (v5.1.2, --genefinding-tool prodigal-m --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go --smcog-trees) and DeepBGC38 (v0.1.18, prodigal-meta-mode). BGCs with only one identified gene and with length <2kbp were removed for both sets of results. For antiSMASH this provided a set of 3,909 BGCs. DeepBGC results which overlapped with antiSMASH were removed if the genomic coordinates of both BGCs overlapped by 30% in either direction. DeepBGC candidates were further filtered for (i) being categorized with a known product class and (ii) containing at least one known biosynthetic pfam or TIGRFAM protein domain as defined by Cimermancic et al.69, providing an additional set of 23,175 BGCs.

All 27,084 BGCs (3909 from antiSMASH + 23,175 from DeepBGC) were first categorized into different product classes: ribosomally synthesized and post-translationally modified peptides (RiPPs), nonribosomal peptide synthetases (NRPs), polyketide synthases (PKS), saccharides and others based on the labels reported by each algorithm. We further unified the antiSMASH and DeepBGC product class labels to integrate both datasets (Supplementary Table1). A fraction of mined BGCs were labeled as hybrids because antiSMASH or DeepBGC associated them with two different product classes e.g., bacteriocin;T1PKS. The BGCs in each product class were grouped into gene cluster families (GCFs) by sequence similarity using BiG-SCAPE39 (v1.01, --include_singletons --mix --no_classify --cutoffs 0.3). A total of 16,055 GCFs were defined by this approach and for each GCF we took the smallest BGC member as a representative of the family. Gene cluster diagrams of BGCs were created using Clinker70.

BGCs in SPMP were classified as novel via a two-step approach. Firstly, BGC sequences were required to have <80% similarity to any existing sequence in the antiSMASH and MIBiG 2.071 databases using the clusterblast results from antiSMASH. Secondly, BGC annotations were compared to antiSMASH annotations from a comprehensive gut microbial genome collection (HRGM) using the standalone clusterblast software72 (v 1.1.0), to identify SPMP matches based on a 80% similarity threshold, similar to the approach described in Gallagher et al73.

Besides bacteriocins, BGC mining in the SPMP database also identified other classes of ribosomally synthesized and post-translationally modified peptides (RiPPs) such as lanthipeptides and lassopeptides (Supplementary Fig.15A), which can also possess antimicrobial properties. Antimicrobial activities of putative peptides encoded by novel RiPP BGCs in SPMP were predicted using an ensemble voting approach (Supplementary Fig.15B) with four different AMP prediction models: AMPscanner74 (v2, convolutional neural network), AmpGram75 (random forest model), AMPDiscover76 (based on quantitative sequence activity models) and ABPDiscover (https://biocom-ampdiscover.cicese.mx/). Peptides predicted by antiSMASH in these RiPP BGCs were translated and all amino acid sequences with a length greater than 10 but lesser than 200 were used as inputs into these four models. Peptides were classified as AMPs if they received votes from both AMPscanner and AmpGram, and at least one vote from either AMPDiscover or ABPDiscover, and if the corresponding RiPP BGCs contained a transporter protein. The performance of this ensemble approach was evaluated using 78 known AMP sequences and 78 scrambled non-AMP sequences taken from the AmpGram benchmark dataset75. For our evaluation dataset, we identified and removed all sequences that were found in the training sets of AMPscanner, AmpGram, AMPDiscover and ABPDiscover using seqkit77 (v0.11.0) and samtools faidx (v1.9). The percentage hydrophobicity and overall charge of selected peptide sequences was determined using the AMP calculator in the AMP database 3 (APD3; https://aps.unmc.edu/prediction).

Out of 107 RiPP BGCs that were not bacteriocins, 54 of them were predicted to also be AMPs. One of these was found to be a lanthipeptide (GCF459) in Dorea longicatena B (Supplementary Fig.15C), with no significant blastp matches to the NCBI nr database. This peptide sequence has a 32% hydrophobic amino acid composition and a net positive charge of +5, which could favor its insertion into the cell walls and membranes of its targets. Another novel AMP is a lassopeptide (GCF26) found in a Ruminococcus species (Supplementary Fig.15D), with similarly high proportion of hydrophobic amino acids (35%) and a slight net positive charge.

To associate BGC presence/absence patterns with microbial community structure, correlation analysis (Fastspar78 v1.0.0, parameters: --iterations 100 --exclude_iterations 20, p-values from 1000 bootstrap replicates and permutation testing) was done based on SLC abundance profiles across samples (species with medium abundance 0.1% filtered out). Correlations in the network were kept if they had an associated p-value <0.05.

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Further information on research design is available in theNature Research Reporting Summary linked to this article.

See original here:
Genome-centric analysis of short and long read metagenomes reveals uncharacterized microbiome diversity in Southeast Asians - Nature.com

Related Posts