TY - JOUR
T1 - Variational inference for semiparametric Bayesian novelty detection in large datasets
AU - Benedetti, Luca
AU - Boniardi, Eric
AU - Chiani, Leonardo
AU - Ghirri, Jacopo
AU - Mastropietro, Marta
AU - Cappozzo, Andrea
AU - Denti, Francesco
PY - 2023
Y1 - 2023
N2 - After being trained on a fully-labeled training set, where the observations are grouped\r\ninto a certain number of known classes, novelty detection methods aim to classify the\r\ninstances of an unlabeled test set while allowing for the presence of previously unseen\r\nclasses. These models are valuable in many areas, ranging from social network and\r\nfood adulteration analyses to biology, where an evolving population may be present.\r\nIn this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also\r\nknown as Brand, recently introduced in the literature. Leveraging on a model-based\r\nmixture representation, Brand allows clustering the test observations into known train-\r\ning terms or a single novelty term. Furthermore, the novelty term is modeled with a\r\nDirichlet Process mixture model to flexibly capture any departure from the known pat-\r\nterns. Brand was originally estimated using MCMC schemes, which are prohibitively\r\ncostly when applied to high-dimensional data. To scale up Brand applicability to large\r\ndatasets, we propose to resort to a variational Bayes approach, providing an efficient\r\nalgorithm for posterior approximation. We demonstrate a significant gain in efficiency\r\nand excellent classification performance with thorough simulation studies. Finally, to\r\nshowcase its applicability, we perform a novelty detection analysis using the openly-\r\navailable Statlog dataset, a large collection of satellite imaging spectra, to search\r\nfor novel soil types.
AB - After being trained on a fully-labeled training set, where the observations are grouped\r\ninto a certain number of known classes, novelty detection methods aim to classify the\r\ninstances of an unlabeled test set while allowing for the presence of previously unseen\r\nclasses. These models are valuable in many areas, ranging from social network and\r\nfood adulteration analyses to biology, where an evolving population may be present.\r\nIn this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also\r\nknown as Brand, recently introduced in the literature. Leveraging on a model-based\r\nmixture representation, Brand allows clustering the test observations into known train-\r\ning terms or a single novelty term. Furthermore, the novelty term is modeled with a\r\nDirichlet Process mixture model to flexibly capture any departure from the known pat-\r\nterns. Brand was originally estimated using MCMC schemes, which are prohibitively\r\ncostly when applied to high-dimensional data. To scale up Brand applicability to large\r\ndatasets, we propose to resort to a variational Bayes approach, providing an efficient\r\nalgorithm for posterior approximation. We demonstrate a significant gain in efficiency\r\nand excellent classification performance with thorough simulation studies. Finally, to\r\nshowcase its applicability, we perform a novelty detection analysis using the openly-\r\navailable Statlog dataset, a large collection of satellite imaging spectra, to search\r\nfor novel soil types.
KW - Bayesian modeling
KW - Dirichlet process
KW - Large datasets
KW - Nested mixtures
KW - Novelty detection
KW - Variational inference
KW - Bayesian modeling
KW - Dirichlet process
KW - Large datasets
KW - Nested mixtures
KW - Novelty detection
KW - Variational inference
UR - https://publicatt.unicatt.it/handle/10807/309183
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85178477392&origin=inward
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85178477392&origin=inward
U2 - 10.1007/s11634-023-00569-z
DO - 10.1007/s11634-023-00569-z
M3 - Article
SN - 1862-5347
SP - 1
EP - 23
JO - Advances in Data Analysis and Classification
JF - Advances in Data Analysis and Classification
IS - 18
ER -