Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 1;48(4):1305-1315.
doi: 10.1093/ije/dyz012.

Outlier detection for questionnaire data in biobanks

Affiliations

Outlier detection for questionnaire data in biobanks

Rieko Sakurai et al. Int J Epidemiol. .

Abstract

Background: Biobanks increasingly collect, process and store omics with more conventional epidemiologic information necessitating considerable effort in data cleaning. An efficient outlier detection method that reduces manual labour is highly desirable.

Method: We develop an unsupervised machine-learning method for outlier detection, namely kurPCA, that uses principal component analysis combined with kurtosis to ascertain the existence of outliers. In addition, we propose a novel regression adjustment approach to improve detection, namely the regression adjustment for data by systematic missing patterns (RAMP).

Result: Application to epidemiological record data in a large-scale biobank (Tohoku Medical Megabank Organization, Japan) shows that a combination of kurPCA and RAMP effectively detects known errors or inconsistent patterns.

Conclusions: We confirm through the results of the simulation and the application that our methods showed good performance. The proposed methods are useful for many practical analysis scenarios.

Keywords: Outlier detection; anomaly detection; kurtosis; principal component analysis; regression adjustment.

PubMed Disclaimer

Publication types

LinkOut - more resources