Chloranthus genome provides insights into the early diversification of angiosperms – Nature.com

Posted: November 28, 2021 at 9:53 pm

Genome sequencing, assembly, and annotation

Chloranthus spicatus has a genome size of 2.97Gb (gigabases) based on K-mer analysis (Supplementary Fig.2, Supplementary Data1); the individual sequenced had a heterozygosity rate of 0.99%, which is possibly associated with the obligate outcrossing system of this species31. Genomic DNA was sequenced using three different methods: 182Gb of Oxford Nanopore Technologies (ONT) long reads, 240Gb of shotgun short reads (BGIseq 2000), and 240Gb of Hi-C data.

The assembled genome was 2,964.14Mb with a contig N50 size of 4.59Mb, covering 99.7% of the genome size as estimated by K-mer analysis (Supplementary Data2,3). Assembled contigs were then clustered into 15 pseudochromosomes, covering 99.9% of the original 2,964.14Mb assembly, with a super-scaffold N50 of 191.37Mb. After performing the Hi-C validation, the genome showed high contiguity, completeness, and accuracy (Supplementary Fig.3) with the 15 pseudochromosomes corroborated by previous chromosome counts of 2n=3032. In all, 21,392 protein-coding genes were predicted using a combination of homology-based and transcriptome-based approaches. The proteome was estimated to be at least 96.8% complete based on BUSCO (benchmarking universal single-copy orthologs) assessment (Supplementary Data4).

The results obtained by Tandem Repeats Finder were mapped to predict coding genes of C. spicatus to estimate the proportion of incorrectly detected paralogous genes (Supplementary Data5). In the C. spicatus genome, repetitive elements accounted for 70.09% of the genome assembly, of which 97.67% were annotated as transposable elements (TEs) (Supplementary Data5). Long terminal-repeat retrotransposons (LTRs) were the major class of TEs and accounted for 58.79% of the genome. Among the LTRs, the most abundant elements were Gypsy (68.03% of the LTRs), followed by Copia (19.01% of the LTRs) (Supplementary Data6). The time of insertion of LTRs in the genome of C. spicatus was estimated based on a peak substitution rate of 0.03 (Supplementary Fig.4). We assumed a synonymous substitution rate of 1.51109 bases per year following two recent studies of the closely related magnoliid lineages Liriodendron and Chimonanthus, resulting in an LTR burst time of approximately 9.9Ma(see methods).

Comparison of the gene and genome characteristics (e.g., genome size, gene size, exon and intron sizes) of C. spicatus and 17 other phylogenetically diverse flowering plants (Supplementary Data7) revealed that long genes and long introns were more prevalent in the genomes of Chloranthales and magnoliids compared to other angiosperms (Fig.2a, b; Supplementary Data8). As the presence of nonfunctional genes and variation in total gene numbers among different species would bias the statistics of gene characteristics, we selected 2,184 high-confidence orthologs from C. spicatus, six magnoliids, and two well-characterized angiosperm genomes, Arabidopsis thaliana and Oryza sativa (Supplementary Fig.5a, Supplementary Data9). Comparison of the lengths of the coding regions and introns revealed that the average coding region lengths in all nine plant genomes were similar (ranging from 1,5331,557bp), whereas the lengths of introns varied greatly (ranging from 1533,681bp; Supplementary Data9). Chloranthus spicatus (3681bp) and the six magnoliid genomes displayed much longer introns (ranging from 1,2702,390bp) than those of A. thaliana (153bp) and O. sativa (372bp), signifying that the presence of longer genes is due to the extension of the intron length rather than coding regions. We separated the 2184 high-confidence orthologs into groups based on length: <5kb (short genes), 510, 1020, and >20kb (long genes). Long genes (>20kb) in C. spicatus (876) were much greater in number than those in Oryza (2) and Arabidopsis (0) (Supplementary Data8,9).

a Comparison of gene and genome characteristics (i.e., genome size, gene size, exon, and intron sizes) of C. spicatus and 17 other flowering plants. b Comparison of the lengths of the coding regions among nine representative plant genomes. c Collinearity patterns between genomic regions of Amborella, Vitis, and Chloranthus. The colored (red/grey) wedges highlight the major syntenic blocks shared among these species. d The number of synonymous substitutions per synonymous site (Ks) distributions confirming the occurrence of a whole-genome duplication (WGD) event in C. spicatus. Source data underlying Fig. 2a are provided as a Source Data file.

We found a significant correlation between intron length and genome size (R2=0.8869). The highly conserved average length and a number of exons among the nine species compared further indicated that exon structure is mostly consistent among the angiosperms. The average length of introns was approximately 1.66kb, 2.87kb, and 3.35kb for Lauraceae, Magnoliaceae, and Chloranthus, respectively (Supplementary Data8).

LTR-RT represents a major fraction of plant genomes, particularly gymnosperms and magnoliids13. Thus, to understand the constitution of introns in C. spicatus, we looked for repeated elements. LTRs were widely detected in the long introns of C. spicatus and appear to be the major contributor to the long introns in this species. For instance, the gene AT1G04950.1 located on Chromosome 1: 14026061408184 encodes Transcription initiation factor TFIID subunit 6 in A. thaliana. The LTR length of this orthologous gene in C. spicatus (Cspi02386) was significantly longer than that in Lauraceae, Magnoliaceae, O. sativa, and A. thaliana (Supplementary Fig.5b).

We discovered 11,500 intact LTRs and classified them into two groups, Gene-20K LTR (LTRs distributed in genes >20kb length) and ALL LTR (LTRs distributed throughout the genome, Supplementary Fig.6). A similar model distribution of Gene-20K LTR and ALL LTR (Supplementary Figs.7, 8) suggested that the insertion timing of both LTR groups was the same. Further analyses of expression levels revealed that genes with short introns were more likely to exhibit low expression, while genes with long introns exhibited higher expression. However, when the intron length was larger than 40kb, the expression level subsequently declined in C. spicatus (Supplementary Fig.9).

Our investigation of collinearity and synteny patterns between genomic regions of Amborella trichopoda (sister to all other extant angiosperms), Vitis vinifera (sister to all other rosids), and C. spicatus showed highly conserved synteny among these three species (Fig.2c). In addition, this analysis showed clear structural evidence of a single ancient WGD in C. spicatus. The syntenic depth ratio between C. spicatus and A. trichopoda was found to be 1:2, which means that each A. trichopoda region could be matched to two genomic regions in the C. spicatus genome while the comparison of C. spicatus with the ancient hexaploid V. vinifera genome revealed a 2:3 syntenic depth ratio (Fig.2c).

To further investigate the extent of conservation of genome structure between C. spicatus and other angiosperms, we performed pairwise synteny comparisons with several species of magnoliids (Magnolia biondii, Liriodendron chinensis, Persea americana, Cinnamomum kanehirae, Litsea cubeba, Phoebe bournei) (Supplementary Data10). Our results clearly showed that C. spicatus shared a higher number (3,029; i.e., 62.7%) of syntenic blocks (at both the scaffold and chromosome level) with species in its sister clade, the magnoliids, than with Ceratophyllales (2,483, 52.5%), V. vinifera (2,275, 56.5%), or the monocot Oryza sativa (1,700, 45.3%) (Supplementary Fig.10, Supplementary Data10). Amborella trichopoda (1,150, 57%) shared the fewest syntenic blocks with C. spicatus among all the species used for comparative analysis (Supplementary Data10); overall, the number of shared syntenic blocks between these representative genomes generally coincided with phylogenetic relationships.

To further investigate the phylogenetic placement of the C. spicatus WGD, we compared the distribution of Ks values, the number of synonymous substitutions per synonymous site. The Ks distribution for C. spicatus paralogs showed an obvious peak at approximately Ks=0.9, and peaks at similar KS values were identified for other species (Ascarina rubricaulis, Chloranthus japonicus, and Sarcandra glabra) of Chloranthales (Fig.2d); the coincidence of these KS values suggests that an ancient WGD event may be shared among all extant members of this clade. Further, the KS values for orthologs shared by C. spicatus and Phoebe bournei (Lauraceae; magnoliids) show a peak value slightly greater than that observed for paralogs in C. spicatus and other Chloranthales species, which suggests that the Chloranthales WGD occurred after the divergence of Chloranthales and magnoliids (Fig.2d). These observations suggest that the ancient WGD event we detected was specific to Chloranthales.

Although magnoliids also exhibit an ancient WGD event (Supplementary Data11a), this event was not shared with Chloranthales. The age of the Chloranthales WGD is similar to that of a number of ancient polyploidy events that occurred independently in several major clades of angiosperms: the gamma () event (103.67129.35Ma) in the common ancestor of core eudicots; the tau () event (101.82138.82Ma) during the early diversification of monocots; the lambda () event (98.22130.04Ma) during the early diversification of magnoliids; the pi () event (85.78119.82Ma) in Nymphaeales; the kappa () event (98.06130.54Ma) in Chloranthales (this study); and an unnamed event specific to Ceratophyllales33 (Fig.1d). Although these WGD events occurred independently, many of the same stress-related genes were retained independently after these WGD events, including two heat shock transcription factors and Arabidopsis response regulators34. These genes also appear to be retained in Chloranthus (Supplementary Figs.11, 12).

To resolve the long-standing uncertainty regarding the phylogenetic position of Chloranthales and relationships among the five major lineages of Mesangiospermae, 257 single-copy nuclear (SCN) genes were identified using whole-genome sequences from C. spicatus and 17 other flowering plants (strict single-copy genes for each species without missing genes; see Methods for species). The aligned protein-coding regions were analyzed using coalescent and concatenation approaches. Both analyses yielded an identical and highly supported topology (bootstrap values of 100%) (Supplementary Fig.13) in which monocots were sister to all other mesangiosperms; Chloranthales appeared as the sister group to magnoliids, and Chloranthales + magnoliids together were sister to Ceratophyllales + eudicots (Fig.2a, Supplementary Fig.13). We also performed phylogenetic inference based on a 937-SCN gene data set with selection criteria allowing a maximum of three missing species. The phylogenetic results showed an identical topology to that of the 257-SCN gene data set, supporting Chloranthales as the sister to magnoliids (Supplementary Fig.13).

To avoid potential errors caused by sparse gene sampling, we extracted 2,329 low-copy nuclear (LCN) genes, allowing a maximum of three copies for each gene in each species. The phylogenetic trees were then similarly reconstructed by both coalescent and concatenation methods as described above. The resulting species trees were topologically identical to the phylogenetic findings as revealed above based on the 257-SCN and 937-SCN data sets (Supplementary Fig.13). Among the 2,329 LCN gene trees, 61% of the trees (454 out of 742 trees) placed Chloranthales as the sister lineage to magnoliids, with bootstrap support greater than 70% (type I, Fig.1c).

As poor taxon sampling may lead to topological errors, we added a large number of published genomes of the angiosperms and two transcriptomes of Chloranthales to increase our taxon sampling. We extracted 612 mostly single-copy orthologous genes following Yang et al.21and generated a 218-species dataset. The phylogenetic results were congruent with the topologies based on analyses of the 257-SCN, 937-SCN, and 2,329-LCN data sets, supporting monocots and a clade of Chloranthales plus magnoliids as successive sister lineages to a clade of Ceratophyllales plus eudicots (Fig.1d, Supplementary Fig.14).

Phylogenetic analyses were also conducted based on chloroplast DNA sequence data. We selected 80 genes, following a recent study that analyzed 2,881 plastomes1, and obtained two data sets, with 18 species and 134 species, respectively. The resultant topologies using both chloroplast data sets agree with those from the four nuclear data sets in strongly supporting a sister relationship between Chloranthales and magnoliids (Supplementary Figs.15, 16).

Although the same pattern of phylogenetic relationships among the five major groups of Mesangiospermae was consistently recovered in analyses of all four nuclear data sets, phylogenetic incongruence was observed among nuclear gene trees. A major conflict was identified among single-gene trees in all four nuclear gene data sets (257-SCN, 937-SCN, 2,329-LCN, and 612-SCN) involving the placement of the Chloranthales-magnoliids clade relative to monocots and eudicots. We summarized the conflict among gene trees in the 2,329-LCN data set with regard to the proportions of trees supporting three different branching patterns for Chloranthales-magnoliids, monocots, and eudicots. The percentage of gene trees supporting the Chloranthales-magnoliids clade plus eudicots together forming a sister group to monocots (Type II) was higher than percentages for the other two topologies (gene trees with BS>70%: Type I, 30%; Type II, 53%; Type III, 17%; Fig.3a). It is notable that Type I and Type III, the two relationships conflicting with the most likely species tree, are not equal in frequency, suggesting gene tree incongruence patterns not expected under ILS alone (below).

a A summary of the conflicts among gene trees in the 2,329-LCN data set with regard to the proportions of trees supporting three different branching patterns for Chloranthales-magnoliids, monocots, and eudicots.b Gene tree incongruence between nuclear (2,329 LCN genes)and plastid (80 plastid genes) treesin a consensus DensiTree plot.c Aconsensus scenario showingancient gene flow between monocots and eudicots, inferred by QuIBL, PhyloNet, and ABBA-BABA D-statistics. Source data underlying Fig. 3b are provided as a Source Data file.

Furthermore, gene tree discordances were also observed between chloroplast and nuclear gene trees. Phylogenetic analyses of these 18 and 134 flowering plants inferred from 80 concatenated plastid genes strongly supported the placement of the Chloranthales-magnoliids clade as sister to all other Mesangiospermae (Fig.3b and Supplementary Figs.15, 16), which is consistent with the Type I nuclear topology (Fig.3a).

A nonrandom incongruence pattern was observed among different topology types: constituent species of monocots (3 spp.), eudicots (4 spp.), and magnoliids (7 spp.) were assigned to a clade. For each topology type, the majority of genes supported the monophyly of C. spicatus and seven species of magnoliids (Type I: 168/316=53.2%; Type II: 297/496=59.9%; Type III: 122/203=60.1%). We also mapped genes that caused conflict on the chromosomes. Genes that supported both Type I and Type II topologies were found to be evenly distributed on the 15 chromosomes (Supplementary Fig.17). Chi-squared tests showed that gene numbers and locations on each chromosome do not differ significantly (Supplementary Data12, 13).

The observed gene tree incongruence between nuclear and chloroplast trees and among nuclear single-gene trees indicates the possibility of incomplete lineage sorting (ILS) and/or hybridization events during early angiosperm evolution. We first used QuIBL, an approach using branch length distributions across gene trees, to infer putative hybridization patterns35. In all, 100 runs of QuIBL were conducted using 500 randomly selected trees from the 2,329 LCN gene trees. Strong hybridization signals (rate >0.1) were identified in several pairs of major clades of angiosperms (Supplementary Figs.18, 19), including: (i) ancestor of eudicots and ancestor of monocots; (ii) ancestor of eudicots and C. spicatus; (iii) ancestor of the species pair Arabidopsis thaliana-Erythranthe guttata and Vitis vinifera; (iv) Erythranthe guttata and Ceratophyllum demersum. Strong signals of ILS were also detected between Lauraceae and Magnoliaceae (Supplementary Figs.18, 19). Among these events, cases (i) and (ii) can be explained as the causes of gene tree incongruence of the Chloranthales-magnoliids clade relative to monocots and eudicots.

A second analytical approach, PhyloNet, was used to further assess putative hybridization events in our phylogeny. Five network searches were carried out by allowing one to five reticulation events. The species network under the best model (AICs=50.78; BICs=30.52, Supplementary Data14) identified two hybridization events among major clades of angiosperms (Supplementary Fig.20a), supporting ancient hybridization between early members of Nymphaeales and monocots. Ancestral gene flow between monocots and eudicots (Supplementary Fig.20a) was additionally supported by results of QuIBL (Supplementary Fig.19). To test whether the PhyloNet results identified hybridization correctly, we repeated the PhyloNet analyses using coalescent trees simulated without hybridization under the ASTRAL species tree (Supplementary Fig.11). As expected, the species network under the best model (AICs=51.21; BICs=30.25, Supplementary Data14) detected no hybridization events among monocots, eudicots, and magnoliids (Supplementary Fig.20b), suggesting that the analysis using empirical gene trees was not susceptible to false positives.

The unequal frequencies of Type I and Type III topologies discordant with the species tree suggested that ILS alone may not explain the gene tree conflicts in this study; therefore, we also used the ABBA-BABA approach to explicitly model patterns of discordant genealogies. This analysis also inferred frequent hybridization signals (Supplementary Fig.21). Consistent with the other two methods employed, the hybridization event between monocots and eudicots was detected, with the largest absolute Z-value (14.6).

In summary, all three methods used to investigate hybridization (QuIBL, PhyloNet, and ABBA-BABA D-statistics) were unanimous in suggesting ancient gene flow between monocots and eudicots, although with variation among methods in the number of hybridization events and any additional lineages involved in hybridization. A consensus scenario is presented (Fig.3c) showing ancient gene flow between monocots and eudicots.

Terpene synthases (TPSs) are key enzymes that control the production of terpenoids, crucial defense compounds in plants36. To explore the evolution of the TPS family in Magnolia and Chloranthales, as well as to garner a better understanding of terpene evolution in angiosperms, we searched for candidate TPSs in C. spicatus and 17 other flowering plants (the same taxon sampling as in the comparative genomics analyses). Chloranthus spicatus encodes 73 TPSs (Supplementary Data15), similar to V. vinifera (75) and A. coerulea (74), while C. kanehirae exhibited the largest number (95) of TPSs. Particularly, compared to the ANA grade, there was higher diversity in almost all of the magnoliid species and C. spicatus (Supplementary Data16). Furthermore, according to the subfamily classification of TPS genes, TPSs were divided into 6 clades: TPS-a, b, c, e, f, and g (Fig.4b). In Amborella and members of Nymphaeaceae (Euryale ferox and Nymphaea colorata), TPS-a was absent (Supplementary Data16). Furthermore, when we performed GO enrichment analyses using the shared genes between magnoliids and Chloranthales, the genes related to terpene synthase activity (GO:0010333) exhibited a low P-value, indicating that terpene synthase activity was the most enriched of all GO categories (Supplementary Data17). Moreover, our gene family analysis indicates that the TPS-a and TPS-b gene clades expanded remarkably in magnoliids and Chloranthales compared to all other angiosperm clades (Supplementary Data16, Fig.4b); these gene clades primarily consist of angiosperm-specific sesquiterpene and monoterpene synthases, respectively. Several unique Chloranthus-specific sesquiterpenoids, including chlorahololides A, chloranthalactone A, and chlotrichenes A and B with bioactive potential, have been isolated and chemically synthesized in the lab37,38,39.

a A total of 44 genes related to the 2-C-methyl-D-erythritol 4-phosphate (MEP) pathway and the mevalonate (MVA) pathway were identified in C. spicatus (left panel). HMGR and DXS exhibited the highest copy numbers in the MEP and MVA pathways, respectively. Differentially expressed genes among seven representative tissues of C. spicatus involved in MEP and MVA pathways are shown in the right panel. b Identification of candidate terpene synthases (TPSs) in C. spicatus and subfamily classification revealed six major clades (TPS-a, b, c, e, f, and g). The gene family tree indicates that TPS-a and TPS-b gene clades are significantly expanded in magnoliids and Chloranthales. c Contraction of R genes in Chloranthales. The nucleotide-binding site-leucine-rich repeat (NBS-LRR) genes were divided into three classes: TIR-NBS-LRR (TNL), CC-NBS-LRR (CNL), and RPW8-NBS-LRR(RNL). In all, 3,518 NBS genes were identified in 28 angiosperm species. * indicates the data were obtained from a previous study40.

To understand the origin of paralog generation of these TPS genes, we compared the numbers of genes in each duplication type among species of magnoliids and Chloranthus (Supplementary Data18). The results showed tandem (23, 33.3%), WGD (18, 26.1%), and transposition (21, 30.4%) duplication events contributed to the expansion of TPSs in C. spicatus, with only a few proximal repeats (7, 10.1%). The 73 CsTPS genes are not evenly distributed across the 15 chromosomes, with Chr2 having the highest concentration of TPS genes. Tandem repeats are mainly present on Chr2 and Chr7 (5 and 6 tandem repeats, respectively), but are also present on chromosomes 4, 5, 9, 14, and 15. We hypothesize that WGD contributed to TPS expansion as well, for instance, the higher copy number of the pairs CsTPS03 and CsTPS33 and CsTPS05 and CsTPS19 (Supplementary Fig.22, Supplementary Data16, 18).

Next, we investigated the genes involved in the production of non-volatile isoprenoids via the 2-C-methyl-D-erythritol 4-phosphate (MEP) pathway and the mevalonate (MVA) pathway and identified 44 genes in C. spicatus that may be involved in these pathways (Supplementary Data19, 20). There were multiple copies of the genes encoding enzymes related to the MVA pathway, and the number was approximately double that detected for genes in the MEP pathway. The gene encoding the HMGR enzyme (Hydroxy-3-methylglutaryl) displayed the highest number of gene copies (12) followed by AACT (Acetoacetyl-CoA thiolase) (6 copies). In the MEP pathway, except for DXS (1-deoxy-D-xylulose-5-phosphate synthase), DXR (1-deoxy-D-xylulose 5-phosphate reductoisomerase1-deoxy-D-xylulose 5-phosphate reductoisomerase), and GGPS (geranylgeranyl pyrophosphate synthase), each remaining enzyme had only one corresponding gene copy. In addition, to further validate this observation, a differential gene expression (DE) analysis was also performed using the transcriptome data from different plant parts (stamen, carpel, and peduncle) (Fig.4a). Regardless of the number of gene copies encoding the enzymes of these pathways, at least one gene copy for each enzyme was highly expressed in each tissue. However, for the multiple-copy genes, a few genes were responsible for most of the expression, while the remaining copies were weakly expressed. Altogether, the analyses of expansion and differential expression of TPSs suggest that the appearance of multiple-copy genes in the MVA pathway could be related to the expansion of the TPS-a subfamily, which is probably responsible for the production of sesquiterpenes in Chloranthales.

Nucleotide-binding site-leucine-rich repeat (NBS-LRR, NBS for short) genes encompass more than 80% of the characterized R genes40. The NBS genes were divided into three classes, namely, TIR-NBS-LRR (TNL), CC-NBS-LRR (CNL), and RPW8-NBS-LRR (RNL)40. We identified 3,518 nucleotide-binding site-leucine-rich repeats (NBS-LRR, NBS for short) genes in 28 angiosperm species, and the nucleotide-binding site-leucine-rich repeat (NBS-LRR) genes were classified into three classes: the Toll/interleukin-1 receptor TIR-NBS-LRR (TNL), N-terminal coiled-coil motif CC-NBS-LRR (CNL), and resistance to powdery mildew8 RPW8-NBS-LRR (RNL)40 (Supplementary Data21). The gene copy number in each NBS class showed considerable variation among the analyzed species (Fig.4c). The highest number of TNL genes was found in M. truncatula (of the 28 species examined), while the highest number of CNLs and RNLs were in S. tuberosum and G. max, respectively. Moreover, M. truncatula, G. max, and S. tuberosum contained more R genes than the other angiosperms examined; Chloranthales and magnoliids contained many fewer R genes. TNL and RNL were absent from Chloranthus and the magnoliids (as in the monocot species, O. sativa), and only 19 and 13 CNLs were present in C. spicatus and Magnolia biondii, respectively (Fig.4c, Supplementary Data21). In the species having both TNL and CNL genes, the CNLs are generally more common than the TNLs, with the exception of A. thaliana, V. vinifera, A. trichopoda, E. ferox, and N. colorata.

Read the original post:
Chloranthus genome provides insights into the early diversification of angiosperms - Nature.com

Related Posts