Identifying Consistent Statements about Numerical Data with 
Dispersion-Corrected Subgroup Discovery

Boley, Mario; Goldsmith, Bryan; Ghiringhelli, Luca M.; Vreeken, Jilles

doi:10.1007/s10618-017-0520-3

アイテム詳細

登録内容を編集ファイル形式で保存

一時保存へ追加

タグ情報を表示リリース履歴を表示詳細要約

公開

学術論文

Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery

MPS-Authors

Boley, Mario
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons104343

Goldsmith, Bryan
Theory, Fritz Haber Institute, Max Planck Society;

/persons/resource/persons21549

Ghiringhelli, Luca M.
Theory, Fritz Haber Institute, Max Planck Society;

Vreeken, Jilles
Databases and Information Systems, MPI for Informatics, Max Planck Society;

External Resource

There are no locators available

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

フルテキスト (公開)

s10618-017-0520-3.pdf
(出版社版), 2MB

付随資料 (公開)

There is no public supplementary material available

引用

Boley, M., Goldsmith, B., Ghiringhelli, L. M., & Vreeken, J. (2017). Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery. Data Mining and Knowledge Discovery, 31(5), 1391-1418. doi:10.1007/s10618-017-0520-3.

引用: https://hdl.handle.net/11858/00-001M-0000-002D-99F7-B

要旨

Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic
estimator framework for optimal subgroup discovery to a new class of objective func-
tions: we show how tight estimators can be computed efficiently for all functions that
are determined by subgroup size (non-decreasing dependence), the subgroup median value, and a dispersion measure around the median (non-increasing dependence). In the important special case when dispersion is measured using the mean absolute deviation from the median, this novel approach yields a linear time algorithm. Empirical evaluation on a wide range of datasets shows that, when used within branch-and-bound search, this approach is highly efficient and indeed discovers subgroups with much smaller errors.