. 2017 Apr 13;17(1):40.

doi: 10.1186/s12911-017-0429-1.

Automatic identification of variables in epidemiological datasets using logic regression

Collaborators, Affiliations

Collaborators

PROG-IMT study group:
Giuseppe D Norata, Jean Philippe Empana, Hung-Ju Lin, Stela McLachlan, Lena Bokemark, Kimmo Ronkainen, Mauro Amato, Ulf Schminke, Sathanur R Srinivasan, Lars Lind, Akihiko Kato, Chrystosomos Dimitriadis, Tadeusz Przewlocki, Shuhei Okazaki, C D A Stehouwer, Tatjana Lazarevic, Peter Willeit, David N Yanez, Helmuth Steinmetz, Dirk Sander, Holger Poppert, Moise Desvarieux, M Arfan Ikram, Sebastjan Bevc, Daniel Staub, Cesare R Sirtori, Bernhard Iglseder, Gunnar Engström, Giovanni Tripepi, Oscar Beloqui, Moo-Sik Lee, Alfonsa Friera, Wuxiang Xie, Liliana Grigore, Matthieu Plichart, Ta-Chen Su, Christine Robertson, Caroline Schmidt, Tomi-Pekka Tuomainen, Fabrizio Veglia, Henry Völzke, Giel Nijpels, Aleksandar Jovanovic, Johann Willeit, Ralph L Sacco, Oscar H Franco, Radovan Hojs, Heiko Uthoff, Bo Hedblad, Hyun Woong Park, Carmen Suarez, Dong Zhao, Alberico Catapano, Pierre Ducimetiere, Kuo-Liong Chien, Jackie F Price, Göran Bergström, Jussi Kauhanen, Elena Tremoli, Marcus Dörr, Gerald Berenson, Aikaterini Papagianni, Anna Kablak-Ziembicka, Kazuo Kitagawa, Jaqueline M Dekker, Radojica Stolic, Stefan Kiechl, Joseph F Polak, Matthias Sitzer, Horst Bickel, Tatjana Rundek, Albert Hofman, Robert Ekart, Beat Frauchiger, Samuela Castelnuovo, Maria Rosvall, Carmine Zoccali, Manuel F Landecho, Jang-Ho Bae, Rafael Gabriel, Jing Liu, Damiano Baldassarre, Maryam Kavousi

Affiliations

¹ Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany. Matthias.lorenz@em.uni-frankfurt.de.
² Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt/Main, Germany.
³ Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany.
⁴ IRCSS Multimedica, Milan, Italy.
⁵ Department of Pharmacological and Biomolecular Sciences, University of Milan, Milan, Italy.
⁶ Institute of Clinical Sciences, University of Oslo, Oslo, Norway.
⁷ Department of Cardiology, Oslo University Hospital Ullevål, Oslo, Norway.
⁸ Atherosclerosis Department, Cardiology Research Center, Moscow, Russia.
⁹ University Medical Center Utrecht, Utrecht, The Netherlands.
¹⁰ Department of Epidemiology and Biostatistics, Erasmus Medical Center, Rotterdam, The Netherlands.
¹¹ Department of Neurology, Medical University Innsbruck, Innsbruck, Austria.

PMID: 28407816
PMCID: PMC5390441
DOI: 10.1186/s12911-017-0429-1

Automatic identification of variables in epidemiological datasets using logic regression

Matthias W Lorenz et al. BMC Med Inform Decis Mak. 2017.

. 2017 Apr 13;17(1):40.

doi: 10.1186/s12911-017-0429-1.

Authors

Collaborators

PROG-IMT study group:
Giuseppe D Norata, Jean Philippe Empana, Hung-Ju Lin, Stela McLachlan, Lena Bokemark, Kimmo Ronkainen, Mauro Amato, Ulf Schminke, Sathanur R Srinivasan, Lars Lind, Akihiko Kato, Chrystosomos Dimitriadis, Tadeusz Przewlocki, Shuhei Okazaki, C D A Stehouwer, Tatjana Lazarevic, Peter Willeit, David N Yanez, Helmuth Steinmetz, Dirk Sander, Holger Poppert, Moise Desvarieux, M Arfan Ikram, Sebastjan Bevc, Daniel Staub, Cesare R Sirtori, Bernhard Iglseder, Gunnar Engström, Giovanni Tripepi, Oscar Beloqui, Moo-Sik Lee, Alfonsa Friera, Wuxiang Xie, Liliana Grigore, Matthieu Plichart, Ta-Chen Su, Christine Robertson, Caroline Schmidt, Tomi-Pekka Tuomainen, Fabrizio Veglia, Henry Völzke, Giel Nijpels, Aleksandar Jovanovic, Johann Willeit, Ralph L Sacco, Oscar H Franco, Radovan Hojs, Heiko Uthoff, Bo Hedblad, Hyun Woong Park, Carmen Suarez, Dong Zhao, Alberico Catapano, Pierre Ducimetiere, Kuo-Liong Chien, Jackie F Price, Göran Bergström, Jussi Kauhanen, Elena Tremoli, Marcus Dörr, Gerald Berenson, Aikaterini Papagianni, Anna Kablak-Ziembicka, Kazuo Kitagawa, Jaqueline M Dekker, Radojica Stolic, Stefan Kiechl, Joseph F Polak, Matthias Sitzer, Horst Bickel, Tatjana Rundek, Albert Hofman, Robert Ekart, Beat Frauchiger, Samuela Castelnuovo, Maria Rosvall, Carmine Zoccali, Manuel F Landecho, Jang-Ho Bae, Rafael Gabriel, Jing Liu, Damiano Baldassarre, Maryam Kavousi

Affiliations

¹ Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany. Matthias.lorenz@em.uni-frankfurt.de.
² Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt/Main, Germany.
³ Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany.
⁴ IRCSS Multimedica, Milan, Italy.
⁵ Department of Pharmacological and Biomolecular Sciences, University of Milan, Milan, Italy.
⁶ Institute of Clinical Sciences, University of Oslo, Oslo, Norway.
⁷ Department of Cardiology, Oslo University Hospital Ullevål, Oslo, Norway.
⁸ Atherosclerosis Department, Cardiology Research Center, Moscow, Russia.
⁹ University Medical Center Utrecht, Utrecht, The Netherlands.
¹⁰ Department of Epidemiology and Biostatistics, Erasmus Medical Center, Rotterdam, The Netherlands.
¹¹ Department of Neurology, Medical University Innsbruck, Innsbruck, Austria.

PMID: 28407816
PMCID: PMC5390441
DOI: 10.1186/s12911-017-0429-1

Abstract

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.

Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.

Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.

Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Keywords: Data management; Epidemiology; Logic regression; Meta-analysis.

PubMed Disclaimer

Figures

**Fig. 1**
Fictitious example of a logic tree combining allocation rules

**Fig. 2**
Sensitivity and specificity as a function of tuning parameters, weights, treesize, minmass and method. At the set point weights = exp(-7), treesize = 8, minmass = 10 for the classification method, the dependency of sensitivity and specificity upon these tuning parameters can be read off this multiple one dimensional plot. On the x-axis in the left most plot, weights are shown as natural logarithm of the actual values that effectively vary from 0.0005 = exp(-7.6) to 0.5 = exp(-0.7)

**Fig. 3**
Sweetspot plot for sensitivity and specificity. The same information as in Fig. 2 as a two dimensional Contour Plot (Sweet Spot Plot) for Specificity and Sensitivity. For low values of weights and high values of minmass, treesize = 8 and the classification method, sensitivity can be raised above 99% without lowering specificity below 75%. On the x-axis, weights are again shown as natural logarithm of the actual values

See this image and copyright information in PMC

References

1. Blettner M, Sauerbrei W, Schlehofer B, Scheuchenpflug T, Friedenreich C. Traditional reviews, meta-analyses and pooled analyses in epidemiology. Int J Epidemiol. 1999;28:1–9. doi: 10.1093/ije/28.1.1. - DOI - PubMed
1. Fortier I, Doiron D, Little J, Ferretti V, L’Heureux F, Stolk RP, Knoppers BM, Hudson TJ, Burton PR, International Harmonization Initiative Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. Int J Epidemiol. 2011;40:1314–28. doi: 10.1093/ije/dyr106. - DOI - PMC - PubMed
1. Doiron D, Burton P, Marcon Y, Gaye A, Wolffenbuttel BH, Perola M, Stolk RP, Foco L, Minelli C, Waldenberger M, Holle R, Kvaløy K, Hillege HL, Tassé AM, Ferretti V, Fortier I. Data harmonization and federated analysis of population-based studies: the BioSHaRE project. Emerg Themes Epidemiol. 2013;10:12. doi: 10.1186/1742-7622-10-12. - DOI - PMC - PubMed
1. Bosch-Capblanch X. Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach. BMC Med Inform Decis Mak. 2011;11:33. doi: 10.1186/1472-6947-11-33. - DOI - PMC - PubMed
1. Lorenz MW, Bickel H, Bots ML, Breteler MMB, Catapano AL, Desvarieux M, Hedblad B, Iglseder B, Johnsen SH, Juraska M, Kiechl S, Mathiesen EB, Norata GD, Grigore L, Polak J, Poppert H, Rosvall M, Rundek T, Sacco RL, Sander D, Sitzer M, Steinmetz H, Stensland E, Willeit J, Witteman J, Yanez D, Thompson SG, The PROG-IMT Study Group Individual progression of carotid intima media thickness as a surrogate for vascular risk (PROG-IMT) – rationale and design of a meta-analysis project. Am Heart J. 2010;159:730–736. doi: 10.1016/j.ahj.2010.02.008. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic identification of variables in epidemiological datasets using logic regression

Collaborators

Affiliations

Automatic identification of variables in epidemiological datasets using logic regression

Authors

Collaborators

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical