TY - JOUR
T1 - Accounting for outliers in optimal subsampling methods
AU - Deldossi, Laura
AU - Pesce, Elena
AU - Tommasi, Chiara
PY - 2023
Y1 - 2023
N2 - Nowadays, in many different fields, massive data are available and for several rea-\r\nsons, it might be convenient to analyze just a subset of the data. The application of\r\nthe D-optimality criterion can be helpful to optimally select a subsample of observa-\r\ntions. However, it is well known that D-optimal support points lie on the boundary of\r\nthe design space and if they go hand in hand with extreme response values, they can\r\nhave a severe influence on the estimated linear model (leverage points with high influ-\r\nence). To overcome this problem, firstly, we propose a non-informative “exchange”\r\nprocedure that enables us to select a “nearly” D-optimal subset of observations with-\r\nout high leverage values. Then, we provide an informative version of this exchange\r\nprocedure, where besides high leverage points also the outliers in the responses (that\r\nare not necessarily associated to high leverage points) are avoided. This is possible\r\nbecause, unlike other design situations, in subsampling from big datasets the response\r\nvalues may be available. Finally, both the non-informative and informative selection\r\nprocedures are adapted to I-optimality, with the goal of getting accurate predictions.
AB - Nowadays, in many different fields, massive data are available and for several rea-\r\nsons, it might be convenient to analyze just a subset of the data. The application of\r\nthe D-optimality criterion can be helpful to optimally select a subsample of observa-\r\ntions. However, it is well known that D-optimal support points lie on the boundary of\r\nthe design space and if they go hand in hand with extreme response values, they can\r\nhave a severe influence on the estimated linear model (leverage points with high influ-\r\nence). To overcome this problem, firstly, we propose a non-informative “exchange”\r\nprocedure that enables us to select a “nearly” D-optimal subset of observations with-\r\nout high leverage values. Then, we provide an informative version of this exchange\r\nprocedure, where besides high leverage points also the outliers in the responses (that\r\nare not necessarily associated to high leverage points) are avoided. This is possible\r\nbecause, unlike other design situations, in subsampling from big datasets the response\r\nvalues may be available. Finally, both the non-informative and informative selection\r\nprocedures are adapted to I-optimality, with the goal of getting accurate predictions.
KW - D-optimality · I-optimality · Active learning · Subsampling
KW - D-optimality · I-optimality · Active learning · Subsampling
UR - https://publicatt.unicatt.it/handle/10807/233890
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85153753227&origin=inward
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85153753227&origin=inward
U2 - 10.1007/s00362-023-01422-3
DO - 10.1007/s00362-023-01422-3
M3 - Article
SN - 0932-5026
SP - 1119
EP - 1135
JO - Statistical Papers
JF - Statistical Papers
IS - 64
ER -