Genetic diversity and ancestry of the Khmuic-speaking ethnic groups … – Nature.com

Posted: September 21, 2023 at 10:16 am

Ethical statement

Ethical approval of this study was granted from the Human Experimentation Committee of the Research Institute for Health Sciences, Chiang Mai university, Thailand (Certificate of Ethical Clearance No. 31/2022). During the research, we protect the rights of participants and their identity, and we confirm that all experiments were performed in accordance with relevant guidelines and regulations based on the experimental protocol on human subjects under the Declaration of Helsinki. Written informed consent from all volunteers was obtained prior to the interview and sample collection.

A total of 95 unrelated subjects residing in five villages of Nan province, Thailand, were enrolled with written informed consent. Volunteers were healthy subjects who were over 20years old, of Khmuic-speaking ethnicity and had no ancestors that were known to be from other recognized ethnic groups for at least three generations. We collected personal data using form-based oral interviews for self-reported unrelated lineages, linguistics, and migration histories. Following the manufacturer's instructions, we collected buccal or saliva samples and extracted DNA using the Gentra Puregene Buccal Cell Kit (Qiagen, Germany).

Genotyping was carried out using the Affymetrix Axiom Genome-Wide Human Origins array10. Affymetrix Genotyping Console v4.2s primary screening produced a total of 93 samples that were genotyped for 622,834 loci on the hg19 version of the human reference genome coordinates (genotype call rate97%). We used PLINK version 1.90b5.224 to exclude loci and individuals with more than 5% missing data and also exclude mtDNA and sex chromosome from our analysis. We further excluded loci that did not pass the HardyWeinberg equilibrium test (P value<0.00005) or had more than 5% missing data, within any population. We used KING 2.325 to determine individual relatedness, and we removed one person from each pair of first degree kinship. After these quality control measures, there are 81 Khmuic-speaking people (Fig.1) with 612,614 loci overall.

We next used PLINK version 1.90b5.2 to merge our newly obtained genotyping results with a set of genome-wide SNP data8, which included populations from East/Southeast Asia, South Asia, African Mbuti, European French, and Southeast Asian ancient samples9,10,11,12,13. It should be noted that in this collection, allelic data from ancient samples was gathered using pseudo-haploid techniques, and samples with less than 15,000 informative loci were eliminated. After filtering the positions of SNPs that can be jointly analyzed within this dataset, we excluded SNPs that had more than 5% missing data or with a minor allele frequency (MAF) less than 3.3104 or were not in HardyWeinberg equilibrium with a significance level of P<0.00005. As a result, 353,505 positions in a dataset consisting of 979 individuals from 90 populations (Supplementary Table 1 and 2) were used for subsequent analysis.

In order to investigate the genetic structure and relationships of the analyzed sample, we used PLINK version 1.90b5.2 to perform pruning for linkage disequilibrium, excluding one variant from pairs with r2>0.4 within windows of 200 variants and a step size of 25 variants. A total of 959 individuals from the sample set, excluding the Mbuti and French populations, were incorporated. There were 149,384 SNPs positions available for this analysis. The Principal Component Analysis (PCA) was carried out using smartpca from EIGENSOFT with the "lsqproject" and "autoshrink" options.

To infer population structure, we employed 155,709 SNP positions derived from a sample set of 979 individuals, which encompassed both Asian samples and the outgroups represented by the Mbuti and French populations, for the ADMIXTURE analysis. The clustering tool ADMIXTURE version 1.3.014 was run from K=2 to K=10 with 100 replicates for each K and using random seeds with the -P option. For each K, the top 20 ADMIXTURE replicates with the highest likelihood for the major mode were displayed using PONG version 1.4.726. For these PCA and ADMIXTURE analyses, the ancient samples and highly drifted modern populations (Mlabri, Onge, Mamanwa, Khamu, and Lua) were projected.

To test admixture and excess ancestry sharing, we used admixr version 0.7.127 from ADMIXTOOLS version 5.110 to calculate the f3 and f4-statistics, with assessed through block jackknife resampling across the genome and using Mbuti as the outgroup. A total of 353,505 SNPs from 979 samples were used in these analyses. Additionalf4-statistics were computed when ancient samples were involved, using French as the outgroup to avoid deep attraction to Africans and only transversions (2,94751,452 SNPs depending on the quality of samples) to avoid potential noise from ancient DNA damage patterns28. We used pheatmap package in R version 3.6.0 to visualize the heatmap of f3 and f4 profiles.

To examine the haplotype sharing between different groups, we used SHAPEIT version 4.1.329 to phase the modern samples. We employed South Asian and East Asian populations as a reference panel (excluding the Kinh Vietnamese) and the recombination map from the 1000 Genomes Phase330 was also used. Our analysis specifically focused on modern population data, consisting of 359,539 SNPs. For the preparation of the reference panel, we extracted individuals of East and South Asian descent, as well as the overlapping sites with our data, for each chromosome from the 1000 Genomes Phase3 data using bcftools version 1.4. The phasing accuracy of SHAPEIT4 can be improved by increasing the number of conditioning neighbors in the Positional Burrows-Wheeler Transform (PBWT) on which haplotype estimation is based29. We conducted phasing with the option -pbwt-depth 8 for 8 conditioning neighbors, while keeping other parameters as default. Subsequently, we employed ChromoPainter version 231 on the phased dataset to initiate the investigation of haplotype sharing with sample sizes for each population were randomly down-sampled to 4 and 8. The former was used for 10 iterations of the EM (expectation maximization) process to estimate the switch rate and global mutation probability. The latter was employed for the chromosomal painting process with the estimated switch and global mutation rates. The output of this process was then used for downstream analyses. We then attempted to paint the chromosomes of each individual, with all the modern Asian samples serving as donors and recipients via the -a argument. The EM estimation yielded a switch rate of approximately 251.21 and a global mutation probability of approximately 0.00001, which were subsequently used as starting values for these parameters for all donors in the painting process. The heatmap results were generated using the pheatmap package in R.

To construct the admixture graph, our initial step involved selecting backbone populations from different language families in Southeast Asia. Specifically, we used f4-statistics to choose representative ethnic groups that speak Austronesian, Tai-Kadai, Austroasiatic, Hmong-Mien, and Sino-Tibetan languages, which included Atayal, Dai, Cambodian, Miao, and Naxi, respectively. We employed the African Mbuti and North Indian populations (Gujarati, Brahmin Tiwari, and Lodhi) who speak Indo-European languages as outgroups. Our focus was on constructing the admixture graph for the Austroasiatic language family in Thailand. Thus, we categorized these populations according to their linguistic branches; Katuic (Bru and Soa), Monic (Mon), Palaungic (Lawa_Eastern, Lawa_Western, Palaung, Blang), and Mlabri. Our interested Khmuic-speaking people were divided into the Khamu (consist of four Khamu populations) and Lua (consist of two Lua populations together with HtinMal and HtinPray).

For modeling the admixture graph, we used a dataset of 359,539 SNPs from modern populations as the input for ADMIXTOOLS 232. Initially, we computed pairwise f2 statistics between the groups using the extract_f2 function with specific parameters; maxmiss=0 (no missing SNPs to calculate), useallsnp: NO (no missing data to allow), and blg=0.05 (SNP block size set in 0.05 morgans). Then, we extracted allele frequency products from the computed f2 blocks using f2_from_precomp. Next, for each scenario, we searched for the best-fitting admixture graph by running ten independent runs of find_graphs. From the 100 independent runs, we selected the one with the lowest score (computed based on residuals between the expected and observed f-statistics given the data) using random_admixturegraph. To confirm the fitting graph, we tested the graph with the lowest score using qpgraph with parameters numstart=100, diag=0.0001, return_fstats=TRUE. This allowed us to check if the absolute value of the worst-fitting Z score was below 3. Starting with no migrations (numadmix=0), we gradually added migrations until we found a fitting graph, which we considered as the best-fitting graph for that particular scenario.

Go here to see the original:
Genetic diversity and ancestry of the Khmuic-speaking ethnic groups ... - Nature.com

Related Posts