DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets
2510.05777v1
cs.LG, cs.CR, q-bio.GN
2025-10-09
Авторы:
Shadi Rahimian, Mario Fritz
Abstract
Single nucleotide polymorphism (SNP) datasets are fundamental to genetic
studies but pose significant privacy risks when shared. The correlation of SNPs
with each other makes strong adversarial attacks such as masked-value
reconstruction, kin, and membership inference attacks possible. Existing
privacy-preserving approaches either apply differential privacy to statistical
summaries of these datasets or offer complex methods that require
post-processing and the usage of a publicly available dataset to suppress or
selectively share SNPs.
In this study, we introduce an innovative framework for generating synthetic
SNP sequence datasets using samples derived from time-inhomogeneous hidden
Markov models (TIHMMs). To preserve the privacy of the training data, we ensure
that each SNP sequence contributes only a bounded influence during training,
enabling strong differential privacy guarantees. Crucially, by operating on
full SNP sequences and bounding their gradient contributions, our method
directly addresses the privacy risks introduced by their inherent correlations.
Through experiments conducted on the real-world 1000 Genomes dataset, we
demonstrate the efficacy of our method using privacy budgets of $\varepsilon
\in [1, 10]$ at $\delta=10^{-4}$. Notably, by allowing the transition models of
the HMM to be dependent on the location in the sequence, we significantly
enhance performance, enabling the synthetic datasets to closely replicate the
statistical properties of non-private datasets. This framework facilitates the
private sharing of genomic data while offering researchers exceptional
flexibility and utility.
Ссылки и действия
Дополнительные ресурсы: