Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package … – Nature.com

Posted: December 14, 2023 at 3:37 am

Description of datasets

In this study we, utilized five independent cohorts including four cancer- and one Alzheimer-related datasets. Gene expression profiling was done using RNA-seq and DNA methylation data were obtained using the Illumina Infinium HumanMethylation450 BeadChip, measuring DNA methylation levels (beta values) on about 450,000 genomic loci.

The TCGA cohorts were obtained using the TCGAbiolinks package50 (Version2.24.3). TCGA-LUSC22 and TCGA-LUAD23 had clinical and genomic data from 589 and 592 patients, respectively (Supplementary TableS2). Information on the pathological stages of the tumors was included in both datasets. We used this information to stratify the patients into distinct risk groups and compared the resulting stratification with clusters obtained from our approach.

TCGA-LIHC24 was provided by a comprehensive study that included 436 cases with clinical information available in the data. We used the Ishak fibrosis score51 and alpha-fetoprotein (AFP) level52,53,54,55,56 to stratify patients into low-, intermediate-, and high-risk groups. The employed score is described later in this section.

TCGA-L AML was provided by a thorough genomic and epigenomic study on 200 adult cases with AML25. The risk groups were defined based on cytogenetic abnormalities25,57.

In addition, we used the ROSMAP cohort provided by the longitudinal cohort studies of aging and dementia. We downloaded the ROSMAP dataset from accelerating Medicines Partnership- AD58 with Synapse IDs syn3388564 (bulk RNA-seq) and syn5850422 (DNA methylation), using the synapser (https://r-docs.synapse.org/articles/synapser.html) R package (Version0.6.61) and a custom R scripts (Version3.6.1)59.

In the TCGA cohorts, events were defined by patients death and the time to an event referred to the duration from the initial diagnosis to death time or the last follow-up. In the ROSMAP cohort, the event was clinical diagnosis of any dementia including mild cognitive impairment with or without other cognitive conditions, Alzheimers dementia with or without other cognitive conditions, and other primary causes of dementia without clinical evidence of Alzheimers dementia. The time to an event in this context referred to the age at which the first dementiarelated diagnosis was made.

To enhance the power of our network, we included cases that have either a single type of data (i.e., gene expression or DNA methylation) or both data available. In the survival analysis, we included only patients whose gene expression, DNA methylation, and survival data were available (Supplementary TableS2).

The initial step in preprocessing involves normalizing the gene expression data. This is accomplished via a logarithmic transformation in based 10 to stabilize the variance and make the data more amenable to following analyses. Because logarithm of zero is not defined, a small offset is added to the expression levels prior to applying this transformation. iNETgrate further preprocesses data in two steps: cleaning and filtering. The former step involved cleaning DNA methylation and clinical data using the wrapper function cleanAllData(). Loci with more than (50%) missing beta values were removed, while loci with less than (50%) missing values were imputed. The imputation was performed by replacing each missing value with the mean of the beta values for the corresponding locus (preprocessDnam()). The clinical data was subsequently cleaned by removing cases with missing survival time and status (prepareSurvival()). The cleaned survival data had patient information including ID, events, time, and risk based on the clinical gold standard.

The second step in the preprocessing data was filtering out genes and loci that have a weak absolute Pearson correlation with survival time and vital status. This was performed by calling electGenes() inside the cleanAllData() wrapper function. In this study, we set the absolute correlation coefficient cutoffs to 0.2 in all TCGA datasets and 0.1 in the ROSMAP dataset.

Every gene and locus that met the quality control criteria was retained for the subsequent steps. In addition, we used computeUnion() to include corresponding loci of the selected genes and corresponding genes of the selected loci in the next steps of analysis.

In iNETgrate, every node represents a gene with two features (i.e., gene expression and DNA methylation values). Therefore, we needed to calculate the DNA methylation value for each gene using computEigenloci(). This function calculated a weighted average of loci levels for their corresponding gene in the following way. When the number of loci corresponding to a gene was less than six, the first principal component (i.e., eigenloci) was calculated directly by taking a weighted average of beta values using PCA. This was the case for almost (95%) of loci in our datasets (Supplementary Fig.S1).

For the remaining (5%) of cases, in which the number of loci representing a gene was six or more, we used findCore() to determine the most connected cluster of loci for each gene. Specifically, a graph was constructed for each gene using the igraph package (Version 1.5.0). In this graph, each locus is represented by a node. We used a fast greedy algorithm60 to calculate the pairwise correlation between loci and detected communities (i.e., clusters) in the graph. Within each community, the average pairwise correlation was computed. The community with the highest average pairwise correlation was identified as a dense subset of highly co-methylated loci in the graph, and the eigenloci value was then computed based on this subset.

We constructed a network in which nodes represent genes and edges are weighted based on the absolute correlation of gene expression and DNA methylation levels for each pair of genes. This was achieved using the makeNetwork() function. The weight of the edges between genes (g_i) and (g_j) was calculated using the following equation:

$$begin{aligned} mathscr {W}(g_i,g_j)=(1-mu )|{{,textrm{cor},}}_E(g_i,g_j)|+ mu |{{,textrm{cor},}}_M(g_i,g_j)|, end{aligned}$$

(1)

Here, (mathscr {W}(g_i,g_j)) describes the integrated similarity between genes (g_i) and (g_j). The term (|{{,textrm{cor},}}_E(g_i,g_j)|) represents the absolute value of the Pearson correlation between the gene expression levels of genes (g_i) and (g_j). Similarly, (|{{,textrm{cor},}}_M(g_i,g_j)|) represents the absolute value of the Pearson correlation between the DNA methylation levels of these two genes. The hyperparameter (mu ) is an integrative factor controlling the relative contributions of gene expression and DNA methylation data in the network. When (mu =0), the network is based solely on gene expression data. Increasing the value of (mu ) emphasizes the DNA methylation data in the model, whereas (mu =1) indicates that only DNA methylation data is used in calculating the edge weights (i.e., gene similarities).

Construction of the network and identification of the modules were done by the wrapper function makeNetwork(), which first uses the pickSoftTreshold() function (RsquaredCut=0.75) from the weighted gene co-expression network analysis20(WGCNA) package (Version 1.72.1) to determine the optimal soft-thresholding power for our integrated network. Then, the blockwiseModules() function (with minModuleSize=5, the absolute value of Pearson correlation, and the default values for the rest of parameters) is utilized to execute a hierarchical clustering approach. This leads to identification of modules, where each module is a group of genes that exhibit similar patterns of expression and DNA methylation. Additionally, module zero is designed to contain outlier genes that cannot be confidently assigned to any module due to their weak or negligible correlation with other genes.

We employed PCA to compute an eigengene for every module (computEgengenes()). In order to balance the contribution of high-risk and low-risk groups, the gene expression and DNA methylation data were oversampled. Intermediate-risk cases were not included in the PCA. An eigengene is computed from a weighted average of gene expression levels ((E^e)), DNA methylation levels ((E^m)), or both ((E^{em})), using the following equations:

$$begin{aligned} E^e = alpha ^e_{_1} g^e_{_1} + alpha ^e_{_2} g^e_{_2} + cdots + alpha ^e_{_n} g^e_{_n}, end{aligned}$$

(2)

$$begin{aligned} E^m = alpha ^m_{_1} g^m_{_1} + alpha ^m_{_2} g^m_{_2} + cdots + alpha ^m_{_n} g^m_{_n}, end{aligned}$$

(3)

$$begin{aligned} E^{em} = (1-mu ) E^e + mu E^m. end{aligned}$$

(4)

Here, n is the number of genes in the module, (g^e_{_i}) is the expression level of gene i, and (g^m_{_i}) is the methylation level corresponding to gene i (i.e., eigenloci), while (alpha ^e_{_n}) and (alpha ^m_{_n}) are the corresponding weights. These weights are computed using PCA ensuring maximum variance and minimum loss of biological information. The eigengene levels are then inferred for the intermediate-risk group using the same weights obtained from PCA. It should be emphasized that regardless of which eigengenes are used, our network and the corresponding modules are consistently constructed based on both gene expression and DNA methylation data and they depend on the (mu ) hyperparameter. The resulting eigengenes are robust features, carrying useful biological information, which can be leveraged in classification, clustering, and other downstream analyses including survival analysis.

To identify the optimal subset of modules for precise prognostication of risk groups, we conducted a two-step survival analysis using analyzeSurvival(). In the first step, we performed a penalized Cox regression analysis using the least absolute shrinkage and selection operator (lasso) penalty29,30 from the glmnet R package61 (Version 4.1.7). This analysis identified the three modules that were most associated with the survival data. Second, we fitted an AFT model31 to each combination of the top three modules and determined the optimal combination that leads to the smallest p-value. p-values were based on a log-rank test with a null hypothesis that there is no difference between the two high- and low-risk groups62.

To categorize the risk groups, iNETgrate uses findAliveCutoff() that searches for a cutoff on the AFT predictions such that the difference between high- vs. low-risk groups is optimized. More specifically, for each risk group, the function iterates over all possible cutoff values leading to a recall of more than a given threshold (i.e., for low-risk: minRecall=0.2, for high-risk: minRecall=0.1 in ROSMAP and 0.05 in other datasets) and selects the cutoff value that maximizes precision.

To ensure the reliability of our integrative approach, we performed a comparative analysis by benchmarking our results against alternative methodologies including a well-known patient similarity network called SNFtool. We also compared our results vs. risk classification according to the clinical gold standards based on the intrinsic nature of the disease in each cohort.

The SNFtool first computes a similarity matrix for each data type (i.e., gene expression and DNA methylation). That is, using each data type independently, a network is constructed where nodes are patients and weights of the edges represent similarity between patients computed based on correlation. The networks (similarity matrices) are then fused to create a consensus network representing the overall similarity between patients across different data types. The resulting patient similarity network is then used to cluster patients into subgroups. We noted that the SNFtool faced some limitations in using all the DNA methylation loci due to memory exhaustion while computing the similarity matrices. We tackled this issue by filtering out loci with a relatively low variation characterized by a standard deviation of less than 0.1. Determining the appropriate cutoff for a given dataset is subjective and challenging for SNFtool users.

In lung cohorts (LUSC and LUAD), we evaluated the risk groups based on the tumor stage. Specifically, we classified stages I,IA,IB,II, and IIA as the low-risk group, stages IIIB and IV as the high-risk group, and the remaining stages as the intermediate-risk group. In the LIHC cohort, we considered a case high-risk if the AFP level was greater than 500 or the Ishak fibrosis score was six. In contrast, patients were considered low-risk if their AFP levels were smaller than 250 and their Ishak fibrosis scores were 0, 1, or 2. The remaining cases were considered intermediate-risk. In the LAML cohort, we utilized the classification system available in the clinical data that categorized cases based on cytogenetic criteria into three groups of favorable (low-risk), intermediate, and poor (high-risk). We utilized the Braak score63 to stratify the ROSMAP cohort into three risk groups. Cases with a Braak score of 0, 1, or 2 were considered low-risk, those with a Braak score of 5 or 6 were classified high-risk, while the remaining cases were grouped as intermediate-risk.

See the rest here:
Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package ... - Nature.com

Related Posts