Genotyping, sequencing and analysis of 140,000 adults from Mexico … – Nature.com

Posted: October 16, 2023 at 6:42 am

Recruitment of study participants

The MCPS was established in the late 1990s following discussions between Mexican scientists at the National Autonomous University of Mexico (UNAM) and British scientists at the University of Oxford about how best to measure the changing health effects of tobacco in Mexico. These discussions evolved into a plan to establish a prospective cohort study that could investigate not only the health effects of tobacco but also those of many other factors (including factors measurable in the blood)1. Between 1998 and 2004, more than 100,000 women and 50,000 men 35years of age or older (mean age 50years) agreed to take part, were asked questions, had physical measurements taken, gave a blood sample and agreed to be tracked for cause-specific mortality. More women than men were recruited because the study visits were predominantly made during working hours when women were more likely to be at home (although visits were extended into the early evenings and at weekends to increase the proportion of men in the study).

Participants were recruited from randomly selected areas within two contiguous city districts (Coyoacn and Iztapalapa). These two districts have existed since the pre-Hispanic period and are geographically close to the ancient Aztec city of Tenochtitlan. Originally, Indigenous populations settled there, but over the centuries, the population dynamics have substantially changed. Many people from Spain, including the conqueror Hernn Corts, resided in Coyoacn while the capital of New Spain was being built over the ruins of Tenochtitlan. The modern populations of Coyoacn and Iztapalapa derive largely from the development of urban settlements and migrations from the 1950s to the 1970s. Over this period, both districts, but particularly Iztapalapa, received large numbers of Indigenous migrants from the central (Nahuas, Otomies and Purepechas), south (Mixtecos, Zapotecos and Mazatecos) and southeast (Chinantecos, Totonacas and Mayas) regions of the country.

At recruitment, a 10-ml venous EDTA blood sample was obtained from each participant and transferred to a central laboratory using a transport box chilled (410C) with ice packs. Samples were refrigerated overnight at 4C and then centrifuged (2,100g at 4C for 15min) and separated the next morning. Plasma and buffy-coat samples were stored locally at 80C, then transported on dry ice to Oxford (United Kingdom) for long-term storage over liquid nitrogen. DNA was extracted from buffy coat at the UK Biocentre using Perkin Elmer Chemagic 360 systems and suspended in TE buffer. UV-VIS spectroscopy using Trinean DropSense96 was used to determine yield and quality, and samples were normalized to provide 2g DNA at 20ngl1 concentration (2% of samples provided a minimum 1.5g DNA at 10ngl1 concentration) with a 260:280nm ratio of >1.8 and a 260:230nm ratio of 2.02.2.

Genomic DNA samples were transferred to the Regeneron Genetics Center from the UK Biocentre and stored in an automated sample biobank at 80C before sample preparation. DNA libraries were created by enzymatically shearing DNA to a mean fragment size of 200bp, and a common Y-shaped adapter was ligated to all DNA libraries. Unique, asymmetric 10bp barcodes were added to the DNA fragment during library amplification to facilitate multiplexed exome capture and sequencing. Equal amounts of sample were pooled before overnight exome capture, with a slightly modified version of IDTs xGenv1 probe library; all samples were captured on the same lot of oligonucleotides. The captured DNA was PCR amplified and quantified by quantitative PCR. The multiplexed samples were pooled and then sequenced using 75bp paired-end reads with two 10bp index reads on an Illumina NovaSeq 6000 platform on S4 flow cells. A total of 146,068 samples were made available for processing. We were unable to process 2,628 samples, most of which failed QC during processing owing to low or no DNA being present. A total of 143,440 samples were sequenced. The average 20 coverage was 96.5%, and 98.7% of the samples were above 90%.

Of the 143,440 samples sequenced, 2,394 (1.7%) did not pass one or more of our QC metrics and were subsequently excluded. Criteria for exclusion were as follows: disagreement between genetically determined and reported sex (n=1,032); high rates of heterozygosity or contamination (VBID>5%) (n=249); low sequence coverage (less than 80% of targeted bases achieving 20 coverage) (n=29); genetically identified sample duplicates (n=1,062 total samples); WES variants discordant with the genotyping chip (n=8); uncertain linkage back to a study participant (n=259); and instrument issue at DNA extraction (n=6). The remaining 141,046 samples were then used to compile a project-level VCF (PVCF) for downstream analysis using the GLnexus joint genotyping tool. This final dataset contained 9,950,580 variants.

Approximately 250ng of total DNA was enzymatically sheared to a mean fragment size of 350bp. Following ligation of a Y-shaped adapter, unique, asymmetric 10bp barcodes were added to the DNA fragments with three cycles of PCR. Libraries were quantified by quantitative PCR, pooled and then sequenced using 150bp paired-end reads with two 10bp index reads on an Illumina NovaSeq 6000 platform on S4 flow cells. A total of 10,008 samples were sequenced. This included 200 motherfatherchild trios and 3more extended pedigrees. The rest of the samples were chosen to be unrelated to third degree or closer and enriched for parents of nuclear families. The average mean coverage was 38.5 and 99% of samples had mean coverages of >30, and all samples were above 27.

Of the 10,008 samples that were whole-genome sequenced, 58 (0.6%) did not pass one or more of our QC metrics and were subsequently excluded. Reasons for exclusion were as follows: disagreement between genetically determined and reported sex (n=16); high rates of heterozygosity or contamination (VBID>5%) (n=10); genetically identified sample duplicates (n=19 total samples); and uncertain linkage back to a study participant (n=14). The remaining 9,950 samples were then used to compile a PVCF for downstream analysis using the GLnexus joint genotyping tool. This final dataset contained 158,464,363 variants.

The MCPS WES and WGS data were reference-aligned using the OQFE protocol35, which uses BWA MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. Single-sample variants were called using DeepVariant (v.0.10.0) with default WGS parameters or custom exome parameters35, generating a gVCF for each input OQFE CRAM file. These gVCFs were aggregated and joint-genotyped using GLnexus (v.1.3.1). All constituent steps of this protocol were executed using open-source software.

Similar to other recent large-scale sequencing efforts, we implemented a supervised machine-learning algorithm to discriminate between probable low-quality and high-quality variants8,12. In brief, we defined a set of positive control and negative control variants based on the following criteria: (1) concordance in genotype calls between array and exome-sequencing data; (2) transmitted singletons; (3) an external set of likely high quality sites; and (4) an external set of likely low quality sites. To define the external high-quality set, we first generated the intersection of variants that passed QC in both TOPMed Freeze8 and gnomADv.3.1 genomes. This set was additionally restricted to 1000 genomes phase1 high-confidence SNPs from the 1000Genomes project36 and gold-standard insertions and deletions from the 1000Genomes project and a previous study37, both available through the GATK resource bundle (https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle). To define the external low-quality set, we intersected gnomADv3.1 fail variants with TOPMed Freeze8 Mendelian or duplicate discordant variants. Before model training, the control set of variants were binned by allele frequency and then randomly sampled such that an equal number of variants were retained in the positive and negative labels across each frequency bin. A support vector machine using a radial basis function kernel was then trained on up to 33 available site quality metrics, including, for example, the median value for allele balance in heterozygote calls and whether a variant was split from a multi-allelic site. We split the data into training (80%) and test (20%) sets. We performed a grid search with fivefold cross-validation on the training set to identify the hyperparameters that returned the highest accuracy during cross-validation, which were then applied to the test set to confirm accuracy. This approach identified a total of 616,027 WES and 22,784,296 WGS variants as low-quality (of which 161,707 and 104,452 were coding variants, respectively). We further applied a set of hard filters to exclude monomorphs, unresolved duplicates, variants with >10% missingness, 3 mendel errors (WGS only) or failed HardyWeinberg equilibrium (HWE) with excess heterozgosity (HWE P<11030 and observed heterozygote count of >1.5 expected heterozygote count), which resulted in a dataset of 9,325,897 WES and 131,851,586 WGS variants (of which 4,037,949 and 1,460,499 were coding variants, respectively).

Variants were annotated as previously described38. In brief, variants were annotated using Ensembl variant effect predictor, with the most severe consequence for each variant chosen across all protein-coding transcripts. In addition, we derived canonical transcript annotations based on a combination of MANE, APPRIS and Ensembl canonical tags. MANE annotation was given the highest priority followed by APPRIS. When neither MANE nor APPRIS annotation tags were available for a gene, the canonical transcript definition of Ensembl was used. Gene regions were defined using Ensembl release 100. Variants annotated as stop gained, start lost, splice donor, splice acceptor, stop lost or frameshift, for which the allele of interest was not the ancestral allele, were considered predicted loss-of-function variants. Five annotation resources were utilized to assign deleteriousness to missense variants: SIFT; PolyPhen2 HDIV and PolyPhen2 HVAR; LRT; and MutationTaster. Missense variants were considered likely deleterious if predicted deleterious by all five algorithms, possibly deleterious if predicted deleterious by at least one algorithm and likely benign if not predicted deleterious by any algorithm.

Samples were genotyped using an Illumina Global Screening Array (GSA) v.2 beadchip according to the manufacturers recommendations. A total of 146,068 samples were made available for processing, of which 145,266 (99.5%) were successfully processed. The average genotype call rate per sample was 98.4%, and 98.4% of samples had a call rate above 90%. Of the 145,266 samples that were genotyped, 4,435 (3.1%) did not pass one or more of our QC metrics and were subsequently excluded. Reasons for exclusion were as follows: disagreement between genetically determined and reported sex (n=1,827); low-quality samples (call rates below 90%) (n=2,276); genotyping chip variants discordant with exome data (n=44); genetically identified sample duplicates (n=1,063 total samples); uncertain linkage back to a study participant (n=268); and sample affected by an instrument issue at DNA extraction (n=6). The remaining 140,831 samples were then used to compile a PVCF for downstream analysis. This dataset contained 650,380 polymorphic variants.

The input array data from the RGC Sequencing Laboratory consisted of 140,831 samples and 650,380 variants and were passed through the following QC steps: checks for consistency of genotypes in sex chromosomes (steps14); sample-level and variant-level missingness filters (steps 5 and 6); the HWE exact test applied to a set of 81,747 third-degree unrelated samples, which were identified from the initial relatedness analysis using Plink and Primus (step7); setting genotypes with Mendelian errors in nuclear families to missing (step8); and a second round of steps57 (step9). Plink commands associated with each step are displayed in column2 (Supplementary Table 9). The final post-QC array data consisted of 138,511 samples and 559,923 variants.

We used Shapeit (v.4.1.3; https://odelaneau.github.io/shapeit4) to phase the array dataset of 138,511 samples and 539,315 autosomal variants that passed the array QC procedure. To improve the phasing quality, we leveraged the inferred family information by building a partial haplotype scaffold on unphased genotypes at 1,266 trios from 3,475 inferred nuclear families identified (randomly selecting one offspring per family when there was more than one). We then ran Shapeit one chromosome at a time, passing the scaffold information with the --scaffold option.

We separately phased the support-vector-machine-filtered WES and WGS datasets onto the array scaffold. The phased WGS data constitute the MCPS10k reference panel. For the WGS phasing, we used WhatsHap (https://github.com/whatshap/whatshap) to extract phase information in the sequence reads and from the subset of available trios and pedigrees, and this information was fed into Shapeit (v.4.2.2; https://odelaneau.github.io/shapeit4) through the --use-PS 0.0001 option. Phasing was carried out in chunks of 10,000 and 100,000 variants (WES and WGS, respectively) and using 500 SNPs from the array data as a buffer at the beginning and end of each chunk. The use of the phased scaffold of array variants meant that chunks of phased sequencing data could be concatenated together to produce whole chromosome files that preserved the chromosome-wide phasing of array variants. A consequence of this process is that when a variant appeared in both the array and sequencing datasets, the data from the array dataset were used.

To assess the performance of the WGS phasing process, we repeated the phasing of chromosome2 by removing the children of the 200 motherfatherchild trios. We then compared the phase of the trio parents to that in the phased dataset that included the children. We observed a mean switch error rate of 0.0024. Without using WhatsHap to leverage phase information in sequencing reads, the mean switch error rate increased to 0.0040 (Supplementary Fig. 23).

The relatedness-inference criteria and relationship assignments were based on kinship coefficients and probability of zero IBD sharing from the KING software (https://www.kingrelatedness.com). We reconstructed all first-degree family networks using PRIMUS (v.1.9.0; https://primus.gs.washington.edu/primusweb) applied to the IBD-based KING estimates of relatedness along with the genetically derived sex and reported age of each individual. In total, 99.3% of the first-degree family networks were unambiguously reconstructed. To visualize the relationship structure in the MCPS, we used the software Graphviz (https://graphviz.org) to construct networks such as those presented in Supplementary Fig. 5. We used the sfdp layout engine which uses a spring model that relies on a force-directed approach to minimize edge length.

To identify IBD segments and to measure ROH, we ran hap-ibd (v.1.0; https://github.com/browning-lab/hap-ibd) using the phased array dataset of 138,511 samples and 538,614 sites from autosomal loci. Hap-ibd was run with the parameter min-seed=4, which looks for IBD segments that are at least 4cM long. We filtered out IBD segments in regions of the genome with fourfold more or fourfold less than the median coverage along each chromosome following the procedure in IBDkin (https://github.com/YingZhou001/IBDkin), and filtered out segments overlapping regions with fourfold less than the median SNP marker density (Supplementary Fig. 28). For the homozygosity analysis, we intersected the sample with the exome data to evaluate loss-of-function variants, which resulted in a sample of 138,200. We further overlaid the ROH segments with local ancestry estimates, and assigned ancestry where the ancestries were concordant between haplotypes and posterior probability was >0.9, assigning ancestry to 99.8% of the ROH.

We used the workflow implemented in the R package bigsnpr (https://privefl.github.io/bigsnpr). In brief, pairwise kinship coefficients were estimated using Plink (v.2.0) and samples were pruned for first-degree and second-degree relatedness (kinship coefficient<0.0884) to obtain a set of unrelated individuals. LD clumping was performed with a default LD r2 threshold of 0.2, and regions with long-range LD were iteratively detected and removed using a procedure based on evaluating robust Mahalanobis distances of PC loadings. Sample outliers were detected using a procedure based on K-nearest neighbours. PC scores and loadings for the first 20 PCs were efficiently estimated using truncated singular value decomposition (SVD) of the scaled genotype matrix. After removal of variant and sample outliers, a final iteration of truncated SVD was performed to obtain the PCA model. The PC scores and loadings from this model were then used to project withheld samples, including related individuals, into the PC space defined by the model using the online augmentation, decomposition and procustes algorithm. For each PC analysis in this study, variants with MAF<0.01 were removed.

Admixture (v.1.3.0; https://dalexander.github.io/admixture) was used to estimate ancestry proportions in a set of 3,964 reference samples representing African, European, East Asian, and American ancestries from a dataset of merged genotypes. This included 765 samples of African ancestry from 1000Genomes (n=661) and HGDP (n=104), 658 samples of European ancestry from 1000Genomes (n=503) and HGDP (n=155), 727 samples of East Asian ancestry from 1000Genomes (n=504) and HGDP (n=223), and 1,814 American samples, including 716 Indigenous Mexican samples from the MAIS study, 64 admixed Mexican American samples from MXL, 21 Maya and 13 Pima samples from HGDP, and 1,000 unrelated Mexican samples from the MCPS. Included SNPs were limited to variants present on the Illumina GSAv.2 genotyping array for which TOPMed-imputed variants in the MAIS study had information r20.9 (m=199,247 SNPs). To select the optimum number of ancestry populations (K) to include in the admixture model, fivefold cross validation was performed for each K in the set 4 to 25 with the cv flag. To obtain ancestry proportion estimates in the remaining set of 137,511 MCPS samples, the population allele frequencies (P) estimated from the analysis of reference samples were fixed as parameters so that the remaining samples could be projected into the admixture model. Projection was performed for the K=4 model and for the K=18 model that produced the lowest cross-validation error, and point estimation was attained using the block relaxation algorithm.

The MAIS genotyping datasets were obtained from L.Orozco from Insituto Nacional de Medicina Genmica. For 644 samples, genotyping was performed using an Affymetrix Human 6.0 array (n=599,727 variants). An additional 72 samples (11 ancestry populations) were genotyped using an Illumina Omni 2.5 array (n=2,397,901 variants). The set of 716 Indigenous samples represent 60 of out the 68 recognized ethnic populations in Mexico3. Per chromosome, VCFs for each genotyping array were uploaded to the TOPMed imputation server (https://imputation.biodatacatalyst.nhlbi.nih.gov) and imputed from a multi-ethnic reference panel of 97,256 whole genomes. Phasing and imputation were performed using the programs eagle and MiniMac, respectively. The observed coefficient of determination (r2) for the reference allele frequency between the reference panel and the genotyping array was 0.696 and 0.606 for the Affymetrix and Illumina arrays, respectively.

Physical positions of imputed variants were mapped from genome build GRCh37 to GRCh38 using the program LiftOver, and only variant positions included on the Affymetrix GSA v.2 were retained. After further filtering out variants with imputation information r2<0.9, the following QC steps were performed before merging of the MAIS Affymetrix and Illumina datasets: (1) removal of ambiguous variants (that is, A/T and C/G polymorphisms); (2) removal of duplicate variants; (3) identifying and correcting allele flips; and (4) removal of variants with position mismatches. Merging was performed using the --bmerge command in Plink (v.1.9).

We used publicly available genotypes from the HGDP (n=929) and the 1000Genomes project (n=2,504). To obtain a combined global reference dataset for downstream analyses of population structure, admixture and local ancestry, the HGDP and 1000Genomes datasets were merged. The resulting merged public reference dataset was subsequently merged with the MAIS dataset and MCPS genotyping array dataset. Each merge was performed using the bmerge function in Plink (v.1.9; https://www.cog-genomics.org/plink) after removing ambiguous variants, removing duplicate variants, identifying and correcting allele flips, and removing variants with position mismatches. The combined global reference dataset comprised 199,247 variants and 142,660 samples.

To characterize genetic admixture within the MCPS cohort, we performed a seven-way LAI analysis with RFMix (v.2.0; https://github.com/slowkoni/rfmix) that included reference samples from the HGDP and 1000Genomes studies, and Indigenous samples from the MAIS study. This merged genotyping dataset of samples across these studies with the 138,511 MCPS participants included 204,626 autosomal variants and 5,363 chromosomeX variants.

To identify reference samples with extensive admixture to exclude from LAI, we performed admixture analysis with the program TeraSTRUCTURE (https://github.com/StoreyLab/terastructure) on a merged genotyping dataset (n=3,274) that included African (AFR), European (EUR) and American (AMR) samples from the HGDP, 1000Genomes and MAIS studies, and 1,000 randomly selected unrelated MCPS samples. Following the recommended workflow in the TeraSTRUCTURE documentation (https://github.com/StoreyLab/terastructure), we varied the rfreq parameter from the set of {0.05, 0.10, 0.15, 0.20} of autosomal variants with K=4 and selected the value that maximized the validation likelihood (20% of autosomal variants; rfreq=45,365). We then varied the K parameter and ran it in triplicate to identify the value that attained a maximal average validation likelihood (K=18). Each of the estimated K ancestries was assigned to a global superpopulation (that is, AFR, EUR and AMR), and the cumulative K ancestry proportion was used as an ancestry score for selecting reference samples. Using an ancestry score threshold of 0.9, 666 AFR, 659 EUR and 616 AMR samples were selected as reference samples. The AMR samples used for seven-way LAI comprised 98 Mexico_North, 42 Mexico_Northwest, 185 Mexico_Central, 128 Mexico_South and 163 Mexico_Southeast individuals.

Reference samples were phased using Shapeit (v.4.1.2; https://odelaneau.github.io/shapeit4) with default settings, and the phasing of the 138,511 MCPS participants was performed as described above (see the section Array phasing). Seven-way LAI was performed using RFMix (v.2.0), with the number of terminal nodes for the random forest classifier set to 5 (-n 5), the average number of generations since expected admixture set to 15 (-G 15), and ten rounds of expectation maximization (EM) algorithm (-e 10). Global ancestry proportion estimates were derived by taking the average per-chromosome Q estimates (weighted by chromosome length) for each of the seven ancestries (that is, AFR, EUR, Mexico_North, Mexico_Northwest, Mexico_Central, Mexico_South and Mexico_Southeast). Inferred three-way global ancestry proportion estimates were obtained by combining proportions for each of the five Indigenous Mexican populations into a single AMR category.

To delineate local ancestry segments for use in the estimation of ancestry-specific allele frequencies (see the section Ancestry-specific allele frequency estimation), we performed a three-way LAI analysis using a merged genotyping dataset that excluded the MAIS samples as this afforded greater genotyping density (493,036 autosomal variants and 12,798 chromosomeX variants). Before LAI analysis, reference samples were selected using the same workflow for TeraSTRUCTURE as described above, with modifications being the inclusion of 10,000 unrelated MCPS participants and an ancestry threshold of 0.95. RFMix was applied as described above, with modifications being the use of 753 AFR, 649 EUR and 91 AMR reference samples, specification of 5 rounds of EM (-e 5), and use of the --reanalyze-reference option, which treated reference haplotypes as if they were query haplotypes and updated the set of reference haplotypes in each EM round.

To measure the correlation in ancestry between partner pairs, we used a linear model to predict ancestry of each partner using the ancestry of their spouse, education level (four categories) and district (Coyoacn and Iztapalapa) of both partners.

We averaged local ancestry dosages (estimated using RFMix at 98,012 positions along the genome) from 78,833 unrelated MCPS samples and performed a per-ancestry scan testing for deviation of local ancestry proportion from the global ancestry proportion19. The test is based on assumptions of binomial sampling and normal approximation for the sample mean. The global ancestry proportion for each ancestry was estimated as a robust average over local ancestry using the Tukeys biweight robust mean. The scan was performed in all autosomes separately for African, European and Indigenous Mexican ancestries with the significance threshold 1.7107=0.05/(98, 0123), which accounts for the number of local ancestry proportions tested and the three ancestries.

IBD segments from hapIBD were summed across pairs of individuals to create a network of IBD sharing represented by the weight matrix (Win {{mathbb{R}}}_{ge 0}^{ntimes n}) for n samples. Each entry ({w}_{{ij}}in W) gives the total length in cM of the genome that individuals i and j share identical by descent. We sought to create a low-dimensional visualization of the IBD network. We used a similar approach to that described in ref. 14, which used the eigenvectors of the normalized graph Laplacian as coordinates for a low-dimensional embedding of the IBD network. Let D be the degree matrix of the graph with ({d}_{{ii}}=sum _{{j}}{w}_{{ij}}) and 0 elsewhere. The normalized (random walk) graph Laplacian is defined to be (L=I-{D}^{-1}W), where I is the identity matrix.

The matrix L is positive semi-definite, with eigenvalues (0={lambda }_{0}le {lambda }_{1}le cdots le {lambda }_{n-1}). The multiplicity of eigenvalue 0 is determined by the number of connected components in the IBD network. If L is fully connected, the eigenvector associated with eigenvalue 0 is constant, whereas the remaining eigenvectors can be used to compute a low-dimensional representation of the IBD network. If p is the desired dimension, and u1,, up the bottom 1p eigenvectors of L (indexed from 0), the matrix (Uin {{mathbb{R}}}^{ntimes p}) with columns u1,, up define a low-dimensional representation of each individual in the IBD network39. In practice, we solved the generalized eigenvalue problem to obtain u1,, up.

If u is an eigenvector of L with eigenvalue , then u solves the generalized eigenvalue problem with eigenvalue 1.

To apply to the IBD network of the MCPS cohort, we first removed edges with weight >72cM as previously done14. We did this to avoid the influence on extended families on the visualization. We next extracted the largest connected component from the IBD network, and computed the bottom u1,, u20 eigenvectors of the normalized graph Laplacian.

To examine fine-scale population structure using haplotype sharing, we calculated a haplotype copying matrix L using Impute5 (https://jmarchini.org/software/#impute-5) with entries Lij that are the length of sequence individual i copies from individual j. Impute5 uses a scalable imputation method that can handle very large haplotype reference panels. At its core is an efficient Hidden Markov model that can estimate the local haplotype sharing profile of a target haplotype with respect to a reference set of haplotypes. To avoid the costly computations of using all the reference haplotypes, an approach based on the PBWT data structure was used to identify a subset of reference haplotypes that led to negligible loss of accuracy. We leveraged this methodology to calculate the copying matrix L, using array haplotypes from a set of 58,329 unrelated individuals as both target and reference datasets, and used the --ohapcopy ban-repeated-sample-names flags to ban each target haplotype being able to copy itself. SVD on a scaled centred matrix was performed using the bigstatsr package (https://cran.r-project.org/web/packages/bigstatsr/index.html) to generate 20 PCs. This is equivalent to an eigen-decomposition of the variance-covariance matrix of recipients shared segment lengths.

We imputed the filtered array dataset using both the MCPS10k reference panel and the TOPMed imputation server. For TOPMed imputation, we used Plink2 to convert this dataset from Plink1.9 format genotypes to unphased VCF genotypes. For compatibility with TOPMed imputation server restrictions, we split the samples in this dataset into six randomly assigned subsets of about 23,471 samples, and into chromosome-specific bgzipped VCF files. Using the NIH Biocatalyst API (https://imputation.biodatacatalyst.nhlbi.nih.gov), we submitted these six jobs to the TOPMed imputation server. Following completion of all jobs, we used bcftools merge to join the resulting dosage VCFs spanning all samples. For the MCPS10k imputation, we used Impute5 (v.1.1.5). Each chromosome was split into chunks using the imp5Chunker program with a minimum window size of 5Mb and a minimum buffer size of 500kb. Information scores were calculated using qctool (https://www.well.ox.ac.uk/~gav/qctool_v2/).

The 1000Genomes WGS genotype VCF files were downloaded (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/) and filtered to remove sites that are multi-allelic sites, duplicated, have missingness >2%, HardyWeinberg P<1108 in any subpopulation and MAF<0.1% in any subpopulation. We used only those 490 AMR samples in the MXL, CLM, PUR and PEL subpopulations. We constructed two subsets of genotypes on chromosome2 from the Illumina HumanOmniExpressExome (8.v1-2) and Illumina GSA (v.2) arrays, and these were used as input to the TOPMed and MCPS10k imputation pipelines.

We measured imputation accuracy by comparing the imputed dosage genotypes to the true (masked) genotypes at variants not on the arrays. Markers were binned according to the MAF of the marker in 490 AMR samples. In each bin, we report the squared correlation (r2) between the concatenated vector of all the true (masked) genotypes at markers and the vector of all imputed dosages at the same markers. Variants that had a missing rate of 100% in the WGS dataset before phasing were removed from the imputation assessment.

The LAI results consist of segments of inferred ancestry across each haplotype of the phased array dataset. As the WES and WGS alleles were phased onto the phased array scaffold, we inferred the ancestry of each exome allele using interpolation from the ancestry of the flanking array sites. For each WES and WGS variant on each phased haplotype, we determined the RFMix ancestry probability estimates at the two flanking array sites and used their relative base-pair positions to linearly interpolate their ancestry probabilities. For a given site, if ({p}_{{ijk}}) is the probability that the jth allele of the ith individual is from population k, and Gij is the 0/1 indicator of the non-reference allele for the jth allele of the ith individual then the weighted allele count (ACk), the weight allele number (ANk) and the allele frequency (k) of the kth population is given by

$${{rm{AC}}}_{k}=mathop{sum }limits_{i=1}^{n}mathop{sum }limits_{j=1}^{2}{p}_{ijk}{G}_{ij},,{{rm{AN}}}_{k}=mathop{sum }limits_{i=1}^{n}mathop{sum }limits_{j=1}^{2}{p}_{ijk},,{theta }_{k}=frac{{{rm{AC}}}_{k}}{{{rm{AN}}}_{k}}$$

An estimate of the effective sample size for population k at the site is ({n}_{k}={{rm{AN}}}_{k}/2). Singleton sites can be hard to phase using existing methods. Family information and phase information in sequencing reads was used in the WGS phasing, and this helped to phase a proportion of the singleton sites. In the WES dataset, we found that 46% of exome singletons occurred in stretches of heterozygous ancestry. For these variants, we gave equal weight to the two ancestries when estimating allele frequencies.

To validate the MCPS allele frequencies, we downloaded the gnomAD v.3.1 reference dataset (https://gnomad.broadinstitute.org) and retained only high-quality variants annotated as passed QC (FILTER=PASS), SNVs, outside low-complexity regions and with the number of called samples greater than 50% of the total sample size (n=76,156). We additionally overlapped gnomAD variants with TOPMed Freeze8 high-quality variants (FILTER=PASS) (https://bravo.sph.umich.edu/freeze8/hg38). We further merged gnomAD variants and MCPS exome variants by the chromosome, position, reference allele and alternative allele names and excluded MCPS singletons, which were heterozygous in ancestry. This process resulted in 2,249,986 overlapping variants available for comparison with the MCPS WES data. Median sample sizes in gnomAD non-Finish Europeans, African/Admixed African and Admixed American populations were 34,014, 20,719 and 7,639, respectively.

To investigate the effect of relatedness on allele frequency estimates, we implemented a method to compute relatedness-corrected allele frequencies using identical-by-descent (IBD) segments. This method computes allele frequencies at a locus by clustering alleles inherited IBD from a common ancestor, then counting alleles once per common ancestor rather than once per sample. Because IBD sharing is affected by both demography and relatedness, we limited IBD sharing to segments between third-degree relatives or closer. Conceptually, this is equivalent to tracing the genealogy of a locus back in time across all samples until no third-degree relatives remain, then computing allele frequencies in the ancestral sample.

We estimated allele frequencies in two steps. First, we constructed a graph based on IBD sharing at a locus. Second, we estimated allele counts and allele numbers by counting the connected components of the IBD graph. Our approach is similar to the DASH haplotype clustering approach40. However, we make different assumptions about how errors affect the IBD graph and additionally compute ancestry-specific frequencies using local ancestry inference estimates.

To construct the IBD graph, suppose we have genotyped and phased N diploid samples at L biallelic loci. For each locus l we construct an undirected graph Gl=(Vl,El) describing IBD sharing among haplotypes. Let the tuple (i, j)l represent haplotype j of sample i at locus l, and let ({h}^{{left(i,jright)}_{l}}in {mathrm{0,1}}) be the allele itself. Define

$$begin{array}{l}{V}_{l},=,{{(i,j)}_{l}:{rm{for}},1le jle 2,{rm{and}},1le ile N}\ {E}_{l},=,{({(i,j)}_{l},{(s,t)}_{l}):{h}^{{(i,j)}_{l}},{rm{and}},{h}^{{(s,t)}_{l}},{rm{are}},{rm{IBD}}}.end{array}$$

In words, the set of vertices V constitute all haplotypes at locus l. Each edge in E is between a pair of haplotypes that fall on the same IBD segment (Supplementary Fig. 25).

If IBD segments are observed without error, then each maximal clique of Gl represents a set of haplotypes descended from a common ancestor. In practice, edges will be missing owing to errors in IBD calling. Thus, what we observe are sets of connected components rather than maximal cliques. Because we limited edges to pairs of third-degree relatives or closer, we assumed missing edges in connected components are false negatives and included them. We additionally removed edges between haplotypes for which the observed alleles conflicted.

Given an IBD graph Gl=(Vl, El) for a locus l, we estimated alternative allele counts and allele numbers by counting the connected components of the graph. Let Cl1,,Clm be the connected components of Gl. Let CALT={Cim: haplotypes in Cim have the ALT allele} and CREF={Cim: haplotypes in Cim have the REF allele}

Then

$$begin{array}{l}AC=| {C}_{{rm{ALT}}}| \ AN=| {C}_{{rm{ALT}}}| +| {C}_{{rm{REF}}}| \ AF=AC,/,ANend{array}$$

We additionally used LAI estimates to compute ancestry-specific frequencies. Let ({p}^{{(i,j)}_{l}}in {{mathbb{R}}}^{K}) be the vector of probabilities that an allele on haplotype j from sample i at locus l comes from one of K populations. For each connected component, we averaged local ancestry estimates

$${bar{p}}_{{C}_{im}}=frac{1}{|{C}_{lm}|}{sum }_{{(i,j)}_{l}in {C}_{lm}}{p}^{{(i,j)}_{l}}$$

We computed a vector of weighted allele counts W and allele numbers N by

$$begin{array}{l}W={sum }_{Cin {C}_{{rm{ALT}}}}{bar{p}}_{C}\ N={sum }_{Cin {C}_{{rm{ALT}}}}{bar{p}}_{C}+{sum }_{Cin {C}_{{rm{REF}}}}{bar{p}}_{C}end{array}$$

Ancestry-specific frequencies were estimated by dividing each component of W by the corresponding component of N.

For singletons for which the phasing of haplotypes was unknown, we averaged local ancestry estimates from haplotypes in the sample.

To generate source datasets for assessing trans-ancestry portability of BMI PRS, whole genome regression was performed using Regenie (https://rgcgithub.github.io/regenie/) in individuals in the MCPS and in a predominantly European-ancestry cohort from the UK Biobank. Individuals with type2 diabetes (ICD10 code E11 or self-reported) were excluded. BMI values underwent rank-based inverse normal transformation (RINT) by sex and ancestry; models were additionally adjusted for age, age2 and technical covariates (UK Biobank). The Regenie summary statistics from the UK Biobank were used to generate a BMI PRS in MCPS; conversely, MCPS summary statistics were applied to UK Biobank statistics.

To avoid overfitting with respect to selection of a PRS algorithm and its associated tuning parameters, LDpred (https://github.com/bvilhjal/ldpred) with value of 1 was chosen from a recent publication of BMI and obesity27. Summary statistics were restricted to HapMap3 variants and followed existing filtering recommendations. In the MCPS, two PRS values were generated; imputed variants were obtained from the MCPS10k reference panel or the TOPMed panel. In the UK Biobank data, PRS values were calculated separately by continental ancestry (African, East Asian, European, Latino, South Asian), determined from a likelihood-based inference approach8 in a merged dataset of variants from UK Biobank and the 1000Genomes project.

To evaluate PRS performance, BMI values were transformed (RINT) by sex and ancestry and regressed on PRS, age and age2. As for the generation of summary statistics, individuals with diabetes were excluded from the analysis. PRS accuracy was assessed by incrementalR2 (proportional reduction in regression sum of squares error between models with and without BMI PRS). Additionally, raw BMI values with PRS, age, age2, sex and ancestry were modelled to obtain per BMI PRS standard deviation effect-size estimates. The impact of ancestry differences on source summary statistics compared to target PRS was assessed with two approaches. For the MCPS, individuals were divided into quantiles by estimated Indigenous Mexican Ancestry using the LAI approach described above. For the UK Biobank, metrics were calculated within each 1000Genomes-based continental ancestry.

The MCPS represents a long-standing scientific collaboration between researchers at the National Autonomous University of Mexico and the University of Oxford, who jointly established the study in the mid-1990s and have worked together on it ever since. Blood sample collection and processing were funded by a Wellcome Trust grant to the Mexican and Oxford investigators. However, at the time, no funding was requested to create an appropriate long-term sample storage facility in Mexico City. Therefore, the Mexican investigators agreed for the samples to be shipped to Oxford where they could be stored in a liquid-nitrogen sample storage archive (funded by the UK Medical Research Council and Cancer Research UK) that had previously been established by the Oxford team, and only on the understanding that control of the samples remained with the Mexican investigators. The shipping of blood samples from Mexico to the United Kingdom was approved by the Mexican Ministry of Health, and the study was approved by scientific and ethics committees within the Mexican National Council of Science and Technology (0595 P-M), the Mexican Ministry of Health and the Central Oxford Research Ethics Committee (C99.260). Although appropriate facilities in Mexico City now exist to store the samples, the Mexican investigators have decided that the costs of sending them back to Mexico exceed the benefits of having closer access to them. Study participants gave signed consent in keeping with accepted ethical practices at the time for observational cohort studies. The baseline consent form stated that their blood samples would be stored and used in the future for unspecified research purposes (with a specific statement that this would include future analysis of genetic factors) and that it would probably be many years before such blood analyses were done. The MCPS consent form also stated that the research was being done in collaboration with the University of Oxford and that the purpose of the study was to benefit future generations of Mexican adults. In 2019, the Mexican and Oxford investigators jointly agreed to allow the extracted DNA to be sent to the Regeneron Genetics Center after they had offered to genotype and exome sequence the entire cohortthereby creating the resource now available for future research by Mexican scientists (see the Data Availability section)in exchange for sharing the other data with them for the purpose of performing joint collaborative genetic analyses. Formal approval to share MCPS data with commercial institutions was sought and obtained from the Medical Ethics Committee of the National Autonomous University of Mexico (FMED/CEI/MHU/001/2020). Major discoveries from the study have been disseminated through open-access scientific publications, local and international scientific meetings, press releases, social media and local television, but direct communication of study results to the original study participants is unfortunately not practical as no information on telephone numbers or email addresses was collected at recruitment. As in other prospective cohort studies (such as the UK Biobank), it was agreed that there would be no feedback of individual blood results to participants, as it has been shown that such feedback can do more harm than good (whereas no feedback ensures that that is not the case).

Recruitment of individuals in the MAIS cohort was done with approval of the leaders of the Indigenous communities and with the support of the National Commission for the Development of Indigenous Communities of Mexico (CDI), now the Instituto Nacional de los Pueblos Indgenas (INPI). All participants provided written informed consent, and authorities or community leaders participated as translators where necessary. The consent form described how findings from the study may have commercial value and be used by for-profit companies. Sample collection for MAIS was approved by the Bioethics and Research Committees of the Insituto Nacional de Medicina Genmica in Mexico City (protocol numbers 31/2011/I and 12/2018/I). Preliminary data from the MAIS cohort have been discussed with the Indigenous leaders and volunteer individuals included in the study, explaining the meaning of the findings on health or populations history, and the potential use of the data in future collaborations.

Further information on research design is available in theNature Portfolio Reporting Summary linked to this article.

Link:
Genotyping, sequencing and analysis of 140,000 adults from Mexico ... - Nature.com

Related Posts