TY - JOUR
T1 - Bayesian nonparametric clustering and association studies for candidate SNP observations
AU - Wang, Charlotte
AU - Ruggeri, Fabrizio
AU - Hsiao, Chuhsing K.
AU - Argiento, Raffaele
PY - 2017
Y1 - 2017
N2 - Clustering is often considered as the first step in the analysis when dealing with an
enormous amount of Single Nucleotide Polymorphism (SNP) genotype data. The lack of
biological information could affect the outcome of such procedure. Even if a clustering
procedure has been selected and performed, the impact of its uncertainty on the
subsequent association analysis is rarely assessed. In this research we propose first a model
to cluster SNPs data, then we assess the association between the cluster and a disease. In
particular, we adopt a Dirichlet process mixture model with the advantages, with respect
to the usual clustering methods, that the number of clusters needs not to be known and
fixed in advance and the variation in the assignment of SNPs to clusters can be accounted.
In addition, once a clustering of SNPs is obtained, we design an individualized genetic score
quantifying the SNP composition in each cluster for every subject, so that we can set up
a generalized linear model for association analysis able to incorporate the information
from a large-scale SNP dataset, and yet with a much smaller number of explanatory
variables. The inference on cluster allocation, the strength of association of each cluster
(the collective effect on SNPs in the same cluster), and the susceptibility of each SNP
are based on posterior samples from Markov chain Monte Carlo methods and the Binder
loss information. We exemplify this Bayesian nonparametric strategy in a genome-wide
association study of Crohn’s disease in a case-control setting.
AB - Clustering is often considered as the first step in the analysis when dealing with an
enormous amount of Single Nucleotide Polymorphism (SNP) genotype data. The lack of
biological information could affect the outcome of such procedure. Even if a clustering
procedure has been selected and performed, the impact of its uncertainty on the
subsequent association analysis is rarely assessed. In this research we propose first a model
to cluster SNPs data, then we assess the association between the cluster and a disease. In
particular, we adopt a Dirichlet process mixture model with the advantages, with respect
to the usual clustering methods, that the number of clusters needs not to be known and
fixed in advance and the variation in the assignment of SNPs to clusters can be accounted.
In addition, once a clustering of SNPs is obtained, we design an individualized genetic score
quantifying the SNP composition in each cluster for every subject, so that we can set up
a generalized linear model for association analysis able to incorporate the information
from a large-scale SNP dataset, and yet with a much smaller number of explanatory
variables. The inference on cluster allocation, the strength of association of each cluster
(the collective effect on SNPs in the same cluster), and the susceptibility of each SNP
are based on posterior samples from Markov chain Monte Carlo methods and the Binder
loss information. We exemplify this Bayesian nonparametric strategy in a genome-wide
association study of Crohn’s disease in a case-control setting.
KW - Bayesian Clustering, Bayesian Nonparametric, Random partitions, Dirichlet process mixture model, GWAS, Logistic regression
KW - Bayesian Clustering, Bayesian Nonparametric, Random partitions, Dirichlet process mixture model, GWAS, Logistic regression
UR - http://hdl.handle.net/10807/146797
UR - https://www.scopus.com/inward/record.uri?eid=2-s2.0-84982224096&doi=10.1016/j.ijar.2016.07.014&partnerid=40&md5=8dfd6a584e5479a80012c9da98a4a256
U2 - 10.1016/j.ijar.2016.07.014
DO - 10.1016/j.ijar.2016.07.014
M3 - Article
SN - 0888-613X
VL - 80
SP - 19
EP - 35
JO - International Journal of Approximate Reasoning
JF - International Journal of Approximate Reasoning
ER -