. 2015 Feb 25;10(2):e0117898.

doi: 10.1371/journal.pone.0117898. eCollection 2015.

Securely measuring the overlap between private datasets with cryptosets

S Joshua Swamidass¹, Matthew Matlock¹, Leon Rozenblit²

Affiliations

¹ Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.
² Prometheus Research LLC, New Haven, CT, USA.

PMID: 25714898
PMCID: PMC4340911
DOI: 10.1371/journal.pone.0117898

Securely measuring the overlap between private datasets with cryptosets

S Joshua Swamidass et al. PLoS One. 2015.

. 2015 Feb 25;10(2):e0117898.

doi: 10.1371/journal.pone.0117898. eCollection 2015.

Authors

S Joshua Swamidass¹, Matthew Matlock¹, Leon Rozenblit²

Affiliations

¹ Department of Pathology, Washington University School of Medicine, St. Louis, MO, USA.
² Prometheus Research LLC, New Haven, CT, USA.

PMID: 25714898
PMCID: PMC4340911
DOI: 10.1371/journal.pone.0117898

Abstract

Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: LR has significant financial interest in Prometheus LLC, a company that builds Open Source software for medical records mining, but is not commercializing the technology presented in this study. The authors have no additional competing interests. This declared interest does not alter the authors’ adherence to PLOS ONE policies on sharing data and materials.

Figures

**Fig 1. Cryptosets are shareable summaries of private data, from which estimates of overlap can be computed.**
They are constructed using a cryptographic hash function to transform private IDs from a dataset into a limited number of public IDs, and then combining these public IDs into a histogram. From this histogram (about 1000 IDs long in practice), the overlap between private datasets can be estimated in a public space. The security of cryptosets relies on the fact that several private IDs map to each public ID. The estimates are based on the Pearson correlation between cryptosets, and can only measure overlap at a predetermined resolution.

**Fig 2. Cryptosets exploit a statistical property of hash functions: they distribute private IDs uniformly across public IDs, assigning several private IDs to the same public ID when truncated.**
Names and birth dates are distributed with discernible patterns in 2010 census data (Panel A). For example, surnames follow a power law distribution, with a few very common names and a long tail of uncommon names. Birth years fall in a comparatively narrow range. Birth days follow a seasonal and weekly cycle. Nonetheless, hashing 250,000 private IDs—of names with dates of birth—generated from these distributions yields uniformly distributed public IDs (Panel B). Moreover, people binned to the same public ID have no real-world connection to one another. Histograms of these public IDs, cryptosets, can be used to estimate the overlap between private datasets (Panel C). These estimates are computed by multiplying the correlation between cryptosets by the geometric mean of their size. P-values, confidence intervals, and other statistical measures of these estimates can be computed. In the figure, an unrealistically small length of 20 is used for clarity.

**Fig 3. Cryptosets can measure the overlap between chemical collections.**
In this figure we compare two molecular libraries, which have about 5000 scaffolds in common. The public IDs from the libraries’ scaffolds are nearly evenly distributed across public IDs, but a subtle, statistically significant correlation demonstrates they overlap. The estimated overlap is quite good. Moreover, the privacy of the libraries is maintained. Within each public ID bin (representative examples shown for one bin), there are both scaffolds unique and common to each library, and there is no way to determine which are which from the cryptosets. Sharing overlaps between molecular libraries could help researchers know when it makes sense to screen a private molecule library with a biological assay.

**Fig 4. Cryptosets stably estimate the overlap proportion between private datasets, no matter the dataset size, and with accuracy tunable by length.**
Each column of figures corresponds to a different number of public IDs: 500, 1000 and 2000. The first row shows the results of an empirical study, demonstrating that the error (the spread of each data series) is stable across all dataset sizes. The second row shows the analytically derived 95% confidence intervals (Equation 9), which closely match the distribution of empirical estimates and are stable across all dataset sizes. Also evident in these figures is that estimate accuracy is tuned by the length (the number of possible public IDs) of the cryptosets.

**Fig 5. Information gain of cryptosets (y-axis) as a function of number of public IDs (the length) and the number of elements in the cryptoset (A on the x-axis).**
Our estimates of the information gain are very close to simulations. The curve almost perfectly fits the data with 5 summation terms. A key finding of this study that is evident in this graph is that the information content in cryptosets rapidly approaches zero as the number of items is increased. This makes sense, because as more items are added the distribution of public IDs approaches a uniform, non-informative distribution.

**Fig 6. Bloom filter overlap estimates are unstable, with sharp dependence on dataset size.**
Bloom filters with one (top row) or two (middle row) hash functions can estimate the overlap between two datasets. However, as the dataset size grows along the x-axis, the estimate error gradually grows and then the estimates suddenly become uncomputable. Once the dataset size becomes too large, the chance of fully saturating the filters sharply increases, causing a sudden, catastrophic increase in error. At saturation, Bloom filters cannot estimate overlap with any accuracy and are plotted as ‘NaN’ values in the graphs. In contrast, cryptoset accuracy (bottom row) is rock-solid stable, and not dependent on dataset size.

**Fig 7. Cryptosets provide a better tradeoff between estimate error and security risk.**
The figures plot the average over 50 simulations of the overlap estimate error (on a log scale) against average information gain (in bits) for cryptosets and Bloom filters over a large range of lengths (ranging from 500 to 10000). The ideal method would have both error and information gain close to zero. The left column of figures uses datasets with 100 items, while the right column of uses datasets with 1000 items. The top row simulates these datasets with an overlap of 40%, while the bottom row simulates with an overlap of 80%. Cryptosets are consistently more accurate and secure than Bloom filters, and this trend is most pronounced in large datasets with lower overlaps (top right).

See this image and copyright information in PMC

References

1. Mervis J (2012) Agencies rally to tackle big data. Science 336: 22–22. 10.1126/science.336.6077.22 - DOI - PubMed
1. Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining models and algorithms In: Aggarwal CC, Yu PS, Elmagarmid AK, editors, Privacy-Preserving Data Mining, Springer; US, volume 34 of Advances in Database Systems pp. 11–52. 10.1007/978-0-387-70992-5_2 - DOI
1. Karakasidis A, Verykios VS (2011) Secure blocking+ secure matching = secure record linkage. J of Comp Science and Engineering 5: 101–106. 10.5626/JCSE.2011.5.3.223 - DOI
1. Johnson SB, Whitney G, McAuliffe M,Wang H, McCreedy E, et al. (2010) Using global unique identifiers to link autism collections. Journal of the American Medical Informatics Association 17: 689–695. 10.1136/jamia.2009.002063 - DOI - PMC - PubMed
1. Kuzu M, Kantarcioglu M, Durham E, Malin B (2011) A constraint satisfaction cryptanalysis of bloom filters in private record linkage In: Privacy Enhancing Technologies. Springer, pp. 226–245. 10.1007/978-3-642-22263-4_13 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Securely measuring the overlap between private datasets with cryptosets

Affiliations

Securely measuring the overlap between private datasets with cryptosets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources