Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes

by isenews 19 October 2025 0 Comment

Overview of the genotyping method

We represent variation as haplotype segments that are short enough to minimize disruption by recombination, allowing precise sharing with an NGS sample through identity by descent²³ while capturing structural information including phased small variation, structural variation and gene conversion events. We label these haplotype segments as pangenome-derived alleles (PAs) and detect PAs shared with an NGS sample. PA boundaries are arranged to study variation of protein-coding genes. Each PA includes consecutive exons separated by <20 kb and 5 kb of flanking sequences (Methods), reflecting functional proximity of short-range transcription factors²⁴ and population-level genomic linkage. PAs typically range between 10 and 100 kb, corresponding to the scale of linkage disequilibrium (LD) blocks²⁵. While PAs generally correspond to individual genes, they also cover fractions of genes with long introns or, conversely, include tandemly arrayed paralogs within 20 kb.

For computational efficiency and to avoid alignment ambiguity in repetitive DNA, we use an alignment-free comparison of low-copy k-mers (DNA fragments of a fixed length k; k = 31) measured in NGS samples to genotype PAs. For each gene, we group all similar PAs in the pangenome, including orthologs, paralogs and homologous pseudogenes, and construct a matrix used in genotyping that contains the k-mer composition of all grouped PAs (Methods). The rows of a matrix correspond to individual PAs, columns correspond to k-mers exclusive to the grouped PAs, and cell values represent the k-mer multiplicity in each PA (Methods and Fig. 1a,b).

Fig. 1: Overview of the genotyping method.

a, Demography of the reference pangenome assemblies. Map credit: Hogweard/Wikimedia Commons. b, Construction of pangenome k-mer matrices for CNV genes. Each individual gene is represented as a vector of counts of k-mers exclusively found among homologous sequences. All similar sequences including paralogs and orthologs are included and integrated as a k-mer matrix. c, Construction of phylogenetic trees based on k-mer matrices. d, Schematic of approach to estimate genotypes of alleles using NGS data. The k-mers from each matrix are counted in NGS data and normalized by sequencing depth. The normalized k-mer counts are projected to all pangenome genes. e, Reprojection of the raw results in the last step to integer solutions recursively based on the phylogenetic tree. f, An illustrative annotation and genotyping results on the SMN1 and SMN2 genes using HPRC samples. On the left side of the classification, the phylogenetic tree and heatmap of pairwise similarities are shown along with a mutation plot based on an MSA highlighting point differences in SMN1 in CHM13. All SMN genes are categorized into five major types and 17 subgroups. SMN1 and SMN2 correspond to the most common types of each paralog; SMN1-2, SMN1 partially converted to SMN2; SMN-conv, converted SMN genes, predominantly mapped to the SMN2 locus and found to be enriched in African populations; SMN2-2, a rare outgroup of SMN2. The GRCh38 assembly includes SMN1-2 and SMN2. The Phe280 T variant that disrupts the splicing of SMN2 transcripts is highlighted in red. The genotyping results of 1kGP continental populations are shown on the right. Rows correspond to subgroups, columns correspond to continental populations, and the colors of pie charts give distributions of copy numbers (CNs) among each continental population. AFR, African; AMR, American; EAS, East Asian; EUR, European; SAS, South Asian; ref, reference.

The genotyping is performed per matrix by identifying a combination of PAs (rows) and their copy number with the least-squared distance between their k-mer counts and that from an NGS sample. The sample k-mer counts are projected into the vector space of each k-mer matrix and assigned integer copy numbers using recursive rounding based on the phylogeny of PA sequences (Methods and Fig. 1c–e), resulting in a list of PA-specific copy numbers (paCNs). For example, there are 178 PAs for SMN genes, the gene family associated with spinal muscular atrophy. This includes copies of SMN1, SMN2 and paralogs with gene conversion²⁶, for example, paralogs mapped to SMN2 that contain the SMN1 version of Phe280, the single nucleotide polymorphism (SNP) responsible for dysfunctional exon 7 splicing of SMN2 (ref. ²⁷) (Fig. 1f).

PA database construction

We constructed a PA database for 3,351 genes previously reported as CNVs^18,20 (Supplementary Table 1), using 114 diploid PacBio HiFi assemblies from the Human Pangenome Reference Consortium (HPRC), the Human Genome Structural Variation Consortium (HGSVC), the Chinese Pangenome Consortium (CPC) and two telomere-to-telomere assemblies^28,29, in addition to GRCh38 and CHM13 (ref. ³⁰). In total, we defined 1,408,209 PAs, organized into 3,307 matrices (Fig. 2a–c).

**Fig. 2: Overview of the database of PAs.**

Because of limited human genetic diversity and stronger LD across short distances, PAs are often highly similar or identical. To reduce dimensionality and facilitate cohort analysis, we used their phylogenetic relationships to merge similar PAs into highly similar subgroups (subgroups) treated as equal states (Methods). In total, we defined 89,236 subgroups, which were used to enumerate all PAs, analogous to human leukocyte antigen (HLA) nomenclature (Supplementary Fig. 1).

To annotate low-frequency variants and reference genome locations for orthologous or paralogous relationships, we mapped PAs to GRCh38 (Supplementary Notes). In total, 164,237 paralogous PAs across 6,389 loci were determined. Paralogous PAs that were similar to their corresponding reference locus (≥80% k-mer similarity) were labeled duplicative, and the remaining lower-identity paralogous PAs were labeled diverged. In total, 10,792 diverged paralogs from 2,734 subgroups were identified across 333 matrices (Fig. 2d). The divergent paralogs represent new sequences recalcitrant to canonical reference analysis. For example, some amylase PAs include paralogs for both AMY1 and AMY2B, reflecting an AMY2B translocation (Fig. 2a).

While most duplications were distal to their original genes, 6,673 PAs reflected proximal (<20 kb) duplications, including 1,646 PAs across 36 genes exhibiting ‘runaway duplication’ (ref. ³¹) with at least three proximal duplications (Supplementary Fig. 2). Proximally duplicated genes were included in the same PA as their ortholog as a heritable unit. Orthologous PAs were classified as reference alleles if they belonged to the same subgroup as the reference gene and as alternative alleles otherwise (Fig. 2d). All PAs were genotyped regardless of paralog–ortholog annotation so that the resulting genotypes contain population and copy number variation.

Ctyper databases capture population diversity

We assessed whether PAs capture unique aspects of genomic information that cannot be replicated by other CNV representations, including copy numbers of reference genes^1,31, singly unique nucleotide k-mers¹ (SUNKs) and large haplotype structures^32,33. We found that PAs provide higher resolution of variation (for example, single-nucleotide variants), as 94.7% of variants are not reflected by sequences in GRCh38. Additionally, both nearby SUNK markers (Fig. 2e) and large haplotype structures were found to be poor proxies for PAs, and only a small proportion of PAs were found to link to SUNKs or larger haplotypes (Methods). Despite largely reduced dimensions, subgroups capture more than 80% of the total population variation (Methods and Fig. 2f). Finally, using saturation analysis³⁴, we estimate that the current cohort represents 98.7% of subgroups in non-Africans and 94.9% in Africans (Methods), suggesting a near-saturated database (Fig. 2g).

Benchmarking genotypes from NGS samples

We genotyped 2,504 unrelated individuals and 641 offspring from the 1000 Genomes Project (1kGP). Most subgroups (99.25%) showed Hardy–Weinberg equilibrium (Methods and Fig. 3a) and thus little bias. There were 27 matrices with >15% subgroups in disequilibrium, which were mostly short genes (median = 4,564 bp) with few low-copy k-mers (Supplementary Table 2). Genotypes were accurate with an average F₁ score for trio concordance of 97.58% (Supplementary Notes, Supplementary Table 3 and Fig. 3b), while 18 matrices had high discordance (>15%), primarily for subtelomeric genes or on sex chromosomes with poorer assembly qualities (Supplementary Table 4).

**Fig. 3: Benchmarking of genotyping results.**

We assessed copy number accuracy and bias among highly duplicated gene families (for example, amylase, NBPF, GOLGA and TBC1D3). The copy numbers derived from genotyping were compared to those from corresponding assemblies for 39 HPRC samples shared with the 1kGP using a database inclusive of these samples. To limit compounded error from misassembled sequences, we excluded samples with low-confidence sequences (Methods). For each sample, we benchmarked on all matrices for which the corresponding assembly was high in copy number (>10). The copy numbers were highly correlated (ρ = 0.996, Pearson correlation) with little bias (Fig. 3c), 0.2% missing copies (false negatives) and 2.4% additional copies (false positives), likely from unassembled genes in assemblies. High concordances remained when tests were expanded to all genotyped genes (ρ >= 0.996, Pearson correlation).

We assessed the sequence similarity of the genotyped alleles to the ground truth genome assembly for the 39 HPRC benchmarking genomes. Each sample was genotyped with the full database (full-set) or the database excluding its corresponding PAs (leave-one-out). We matched the genotyped PAs to the corresponding assembly PAs (Methods), excluding introns and decoys and sequences with <1 kb of nonrepetitive bases, and measured the similarity between the genotyped allele and the assigned query. We performed a similar analysis, treating the closest neighbor to each assembly PA from the database as the correctly genotyped locus. Due to mismatching from database sampling or misassemblies, 2.9% of PAs from the leave-one-out experiment and 1.0% from the full-set experiment were not paired with truth copies for assessment. For the full set, paired PAs had 0.36 mismatches per 10 kb, with 93.0% having no mismatches in nonrepetitive regions. The leave-one-out tests had 2.7 mismatches per 10 kb in nonrepetitive regions, which was 1.2 additional mismatches per 10 kb from the optimal solutions (closest neighbors); 57.3% of alleles had no mismatches, and 77.0% were mapped to the optimal solution (Fig. 3d). The leave-one-out results were 96.5% more similar to the original PAs than the closest GRCh38 gene at 79.3 mismatches per 10 kb.

To isolate sources of errors in cases of misassemblies, we directly compared leave-one-out genotyping results to a telomere-to-telomere assembly²⁸ of genic PAs. The sample genotypes had 11,627 correctly matched subgroups, 599 (4.8%) mistyped to other subgroups, 131 from subgroups unique to the assembly (1.1%; out of reference), 127 false positives (0.5% F₁) and 93 false negatives (0.4% F₁) for a total F₁ error of 6.7% (Methods and Fig. 3e), with a copy number agreement of 99.1%. This is a 3% increase in mistypes compared to trio discordance.

The computational requirements are sufficient for biobank analysis. The average runtime for genotyping 3,351 genes at 30× coverage was 80.2 min (1.0 min per 1× coverage for sample preprocessing and 0.9 s per gene for genotyping) on a single core (Fig. 3f) using ~20 GB of RAM, with support for parallel processing.

We compared the HLA, KIR and CYP2D6 genotypes to the locus-specific methods T1K³⁵ and Aldy³⁶. For 31 HLA genes, ctyper had an F₁ score of 98.9% across all four fields of HLA nomenclature^37,38 against the full-set analysis and a score of 86.3% for the leave-one-out analysis, while T1K had 70.8%. For protein-coding products (first two fields), ctyper reached 99.98% against the full-set analysis (with 99.9% copy number F₁ correctness) and 96.5% (with 99.5% copy number F₁ correctness) for the leave-one-out analysis, and T1K had 97.2% (Fig. 3g and Supplementary Tables 5 and 6). For 14 KIR genes, ctyper reached 98.5% across all fields in the full-set analysis and 70.6% for the leave-one-out analysis, while T1K had 32.0% due to the limited database. For protein-coding products (first three digits), ctyper reached 99.2% against the full-set analysis (with 99.9% copy number F₁ correctness) and 88.8% for the leave-one-out analysis (with 99.2% copy number F₁ correctness), while T1K had 79.6% (Supplementary Fig. 3). Benchmarking CYP2D6 star annotations of assemblies³⁹, ctyper reached 100.0% against the full-set analysis and 83.2% for the leave-one-out analysis, compared to 80.0% using Aldy (Fig. 3h). There was perfect agreement of SNP variants for ctyper against the full-set analysis and 95.7% for the leave-one-out analysis, compared to 85.2% using Aldy.

Finally, we used ctyper to genotype 273 CMR genes⁴⁰. Unrepetitive regions averaged 0.29 mismatches per 10 kb against the full-set analysis, 99.7% fewer than when comparing assemblies to corresponding GRCh38 sequences (baseline). The genotypes using leave-one-out databases had 4.9 mismatches per 10 kb, 94.8% fewer than baseline (Supplementary Figs. 4–6). Including repeat-masked low-complexity sequences (for example, variable-number tandem repeats), there were 10.5 mismatches per 10 kb against the full-set analysis (97.6% fewer than baseline) and 74.7 mismatches per 10 kb for the leave-one-out analysis (82.7% fewer than baseline; Supplementary Figs. 7–9).

We compared genotyping of HLA and CMR genes to a contemporary method using pangenomes, Locityper⁴¹, using leave-one-out analysis. For HLA, Locityper achieved an F₁ score of 87.9% (versus ctyper, 86.3%) for predicting all four nomenclature fields, while ctyper performed slightly better on the first two fields for protein-coding variants (96.5% versus 94.0%; Supplementary Notes), although ctyper had a roughly 218× speedup due to alignment-free genotyping. When analyzing CMR genotypes, there were 19.8 fewer mismatches per 10 kb than the Locityper genotypes in comparable regions (Supplementary Notes, Supplementary Table 7 and Supplementary Fig. 10).

Sequence-level diversity of CNVs in global populations

We used principal-component (PC) analysis (PCA) to examine the population structure of PA genotypes in the 2,504 unrelated 1kGP samples, 879 Genotype–Tissue Expression (GTEx) samples and 105 diploid assemblies (excluding HGSVC due to quality filtering), excluding rare subgroups (<0.05 allele frequency) and limiting copy number to ten to balance the weights of PCs (Fig. 4a). All data cluster by population as opposed to source, suggesting little bias between genotyping and assembly or across NGS cohorts. The top 0.1% highest-weighted subgroups in PC1 have an average aggreCN variance of 26.33, significantly larger than the overall of 4.00 (P value = 1.11 × 10⁻¹⁶, F-test). Similarly, PC2 and PC3 have mean aggreCN variances of 19.73 and 7.20, suggesting that CNVs are weakly associated with sequence variants. Furthermore, PC1 is the only PC that clustered all samples into the same sign with a geographic center away from 0, suggesting that it corresponds to modulus variance (hence aggreCN) if treating samples as vectors of paCNs. Meanwhile, PC2 and PC3 were similar to the PCA plots based on SNP data of global samples⁴², suggesting that they are associated with the sequence diversity of CNV genes. The total number of duplications is elevated in African populations (Fig. 4b), reflected in the order of PC1 (Fig. 4a).

**Fig. 4: Global population diversity in allele-specific copy number variation.**

We examined ctyper genotypes to measure the extent to which duplications show population specificity. We used the F statistic, a generalization of F_ST that accommodates more than two genotypes (Methods), to test the differences in distributions across continental populations (Fig. 4c). In total, 4.4% (223 of 5,065) of duplicated subgroups showed population specificity (F statistic > 0.2; Supplementary Table 8). The subgroups of PAs with the highest F statistic (0.48) contain duplications of HERC2P9, a known differentiated gene⁷. Additionally, a converted copy of SMN2 annotated as a duplication of SMN1 is enriched in African populations (F statistic = 0.43).

We then measured the divergence of duplicated genes from their reference copies, indicating recent or ancient duplications and providing a measure of reference bias from missing paralogs. We constructed multiple-sequence alignments (MSAs; Methods) for sequences of each matrix and measured all pairwise differences in nonrepetitive sequences. We determined the average paralog sequence divergence relative to the ortholog divergence (Methods), which we refer to as the relative paralog divergence (RPD). We also measured copy number diversity using mean absolute error (MAE), indicating the CNV level among populations (Fig. 4d). Based on RPD, using density-based spatial clustering of applications with noise⁴³, we identified two peaks at 0.71 and 3.2, with MAE centers at 0.18 and 0.93, corresponding to genes with rare and recent CNVs and more divergent and common CNVs, respectively. The latter reflect CNVs on different structural haplotypes that cannot be analyzed using a single reference genome. For example, AMY1A has a high RPD at 3.10 because of truncated duplications. These results are consistent with ancient bursts of duplications in human evolution⁴⁴.

We next used ctyper genotypes to investigate recombination at different CNV loci. We determined multiallelic LD⁴⁵ (mLD; Methods) between PAs using the unrelated 1kGP samples for 989 subgroups that were adjacent less than 100 kb apart in GRCh38 and reported the average mLD within each matrix (Fig. 4e). There was a stronger negative rank correlation between MAE of copy number and mLD (ρ = −0.24, P value = 3.4 × 10⁻¹⁵, Spearman’s rank) than the rank correlation between mLDs and locus length (ρ = −0.21, P value = 1.5 × 10⁻¹¹, Spearman’s rank), suggesting a reduced haplotype linkage in genes with frequent CNVs. The lowest mLD (0.013) was found in FAM90, a gene with frequent duplications and rearrangements⁴⁶. The 29 loci with highest mLD (mLD > 0.7) were enriched in the sex chromosomes (n = 19). Furthermore, HLA-B and HLA-DRB had mLD > 0.7 and only deletion CNV (Supplementary Notes).

eQTL analysis

To investigate the impact of paCNs on expression, we performed eQTL analysis using the Genetic European Variation in Disease⁴⁷ (GEUVADIS) and GTEx⁴⁸ cohorts. There were 4,512 genes that could be uniquely mapped in RNA-seq alignments. An additional 44 genes, such as SMN1, SMN2, AMY1A, AMY1B and AMY1C, have indistinguishable transcription products and were analyzed by pooling among all copies. We assigned PAs to these transcripts based on exonic sequences and performed association analysis with paCNs (Methods and Supplementary Table 9).

After merging paCNs to aggreCNs, 5.5% (178 of 3,224) of transcripts showed significance (corrected P = 1.6 × 10⁻⁵, Pearson correlation) as previously observed³¹. By contrast, when updating aggeCNs by individual paCNs and performing multivariable linear regression on expression (Methods), there were significant improvements in fit for 27.6% (890 of 3,224) of transcripts (corrected P = 1.6 × 10⁻⁵, one-tailed F-test; Fig. 5a). To test whether the fit was explained by the nonuniform expression of different alleles of the same reference gene, we used a linear mixed model (LMM; Methods) to regress total expression to individual subgroups and estimate allele-specific expression and then compared these values to other subgroups of the same matrix that were assigned to the same reference gene (Supplementary Table 10). For subgroups within solvable matrices and more than ten samples, we found that 7.94% (150 of 1,890) of paralogs and 3.28% (546 of 16,628) of orthologs had significantly different expression levels (corrected with sample size = number of paralogs + orthologs, corrected P = 2.7 × 10⁻⁶, χ² test; Fig. 5b). Overall, paralogs were found to have reduced expression (Fig. 5c), consistent with previous findings for duplicated genes⁴⁹.

**Fig. 5: The impact of allele-specific copy number variation on gene expression.**

We compared expression in 57 tissues in the GTEx samples to test for preferential expression of paralogs (Methods and Supplementary Table 11). There was alternative tissue specificity for 132 of 2,820 paralogs (4.68%) and 225 of 19,197 orthologs (1.17%) (corrected P = 6.4 × 10⁻⁸, union of two χ² tests; Methods and Fig. 5d).

Additionally, we used analysis of variance (ANOVA) to estimate the proportion of expression variance (R²) explained by paCNs in GEUVADIS expression data and compared it to that in a model based on known SNPs, indels and eQTL structural variants (SVs)⁵⁰ (Methods). As expected, the highly granular paCNs explained the most variance: on average, 10.3% (14.3% including baseline). By contrast, 58.0% of transcripts are genes with known eQTL variants that explained valid variance by 2.14% (1.60% considering experimental noise, in agreement with a previous estimate of 1.97%⁵¹). On average, 1.98% of the variance was explained by aggreCNs, and 8.58% was explained by subgroup information. When combining both paCNs and known eQTL sites, 10.4% (19.0% including baseline) of the valid variance was explained (Fig. 5e).

We examined the SMN and AMY2B genes as case studies due to their importance in disease and evolution^27,52. The SMN genes were classified as SMN1, SMN2 and SMN-converted. We found no significant difference between the expression of all transcripts of SMN1 and SMN2 (0.281 ± 0.008 versus 0.309 ± 0.009; P = 0.078, χ² test). However, significant differences were found between SMN-converted, and SMN1 and SMN2 (0.226 ± 0.012 versus 0.294 ± 0.002; P = 1.75 × 10⁻⁷, χ² test), with a 23.0% reduction in SMN-converted expression. By contrast, despite having lower overall expression, SMN-converted had 5.93× the expression of valid exon 7 splicing⁵³ of SMN2 (P = 2.2 × 10⁻¹⁶, χ² test), indicating that SMN-converted has full functional splicing⁵⁴ but lower overall expression (Fig. 5f). We studied the expression of AMY2B duplications, including alleles translocated proximally to other AMY genes, such as the PAs containing AMY1 and AMY2B in Fig. 2a. Using probabilistic estimation of expression residuals (PEER)-corrected GTEx pancreas data, we found that translocated AMY2B genes had significantly higher expression than other duplications (1.384 ± 0.233 versus −0.275 ± 0.183, P = 7.87 × 10⁻⁹, χ² test) (Fig. 5g).

First Appeared on
Source link

Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes

Overview of the genotyping method

PA database construction

Ctyper databases capture population diversity

Benchmarking genotypes from NGS samples

Sequence-level diversity of CNVs in global populations

eQTL analysis

isenews (Website)

Leave a Reply Cancel reply

Australia urged to ‘plan and prepare’ for Nipah virus amid

22 Wholesome As Hell Things That Happened At The Grammys

Internal document shows the Vietnamese military preparing for a possible

Brie Bella ‘kept hearing no’ about her WWE returns, credits

Get The Post

Our Sites

Quick Links

Follow Us

Download App

Subscribe

Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes

Overview of the genotyping method

PA database construction

Ctyper databases capture population diversity

Benchmarking genotypes from NGS samples

Sequence-level diversity of CNVs in global populations

eQTL analysis

Share This:

isenews (Website)

Leave a Reply Cancel reply

Get The Post

Our Sites

Quick Links

Follow Us

Download App

Subscribe