TY - JOUR
T1 - Model-Based Clustering of Categorical Data Based on the Hamming Distance
AU - Argiento, Raffaele
AU - Filippi-Mazzola, Edoardo
AU - Paci, Lucia
PY - 2024
Y1 - 2024
N2 - A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
AB - A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
KW - Bayesian clustering
KW - Conditional algorithm
KW - Dirichlet process
KW - Finite mixture models
KW - Markov chain Monte Carlo
KW - Bayesian clustering
KW - Conditional algorithm
KW - Dirichlet process
KW - Finite mixture models
KW - Markov chain Monte Carlo
UR - http://hdl.handle.net/10807/301459
U2 - 10.1080/01621459.2024.2402568
DO - 10.1080/01621459.2024.2402568
M3 - Article
SN - 0162-1459
SP - 1
EP - 23
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
ER -