Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 24:25:104004.
doi: 10.1016/j.dib.2019.104004. eCollection 2019 Aug.

Descriptive statistics and visualization of data from the R datasets package with implications for clusterability

Affiliations

Descriptive statistics and visualization of data from the R datasets package with implications for clusterability

Naomi C Brownstein et al. Data Brief. .

Abstract

The manuscript describes and visualizes datasets from the datasets package in the R statistical software, focusing on descriptive statistics and visualizations that provide insights into the clusterability of these datasets. These publicly available datasets are contained in the R software system, and can be downloaded at https://www.r-project.org/, with documentation provided at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html. Further information on clusterability is found in the companion to this article, To Cluster or Not to Cluster: An Analysis of Clusterability Methods? (https://doi.org/10.1016/j.patcog.2018.10.026). Brief descriptions and graphs of the variables contained in each dataset are provided in the form of means, extrema, quartiles, standard deviation and standard error. Two-dimensional plots for each pair of variables are provided. Original references to the data sets are included when available. Further, each dataset is reduced to a single dimension by each of two different methods: pairwise distances and principal component analysis. For the latter, only the first component is used. Histograms of the reduced data are included for every dataset using both methods.

Keywords: Datasets; Dimension reduction; Histograms; Pairwise distances; Principal component analysis.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Plot of Faithful Data. Waiting time between eruptions vs. eruption duration, both measured in minutes.
Fig. 2
Fig. 2
This figure shows 2D projections of the famous iris dataset. The first four variables are measured in centimeters. The fifth variable is an indicator of which of three species the observation belongs.
Fig. 3
Fig. 3
Plot of Rivers Data. Lengths are measured in miles.
Fig. 4
Fig. 4
Pairwise plot of variables in the swiss data. Fertility is a standardized measure. All other variables are proportions of the populations falling into a certain category: agricultural job, high performance on the army exam, educational attainment past primary school, Catholic religion membership, and infant mortality.
Fig. 5
Fig. 5
Plots of Attitude Data. Responses correspond to the percentage of favorable responses within a department on the corresponding topic.
Fig. 6
Fig. 6
Plot of Cars Data: Stopping distance vs. speed. Stopping distance was measured in feet, and speed was measured in miles per hour.
Fig. 7
Fig. 7
Plot for the Trees Data. Girth is measured in inches, while height is in feet, and volume in cubic feet.
Fig. 8
Fig. 8
Two dimensional plot for USJudgeRatings data. The plots include pairwise plots of twelve ratings by lawyers for judges from the U.S. Superior Court. Ratings: 1) number of contacts 2) judicial integrity 3) demeanor 4) diligence 5) case flow 6) prompt decisions 7) preparation for trial 8) familiarity with law 9) sound oral rulings 10) sound written rulings 11) physical ability 12) worthiness of retention.
Fig. 9
Fig. 9
Two-dimensional projections of the USArrests data. Murder, Assault, and Rape refer to the count of arrests per one-hundred thousand residents. Urban population is the proportion of the population within the state living in an urban area.
Fig. 10
Fig. 10
Projections for Faithful. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 11
Fig. 11
Projections for Iris. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 12
Fig. 12
Rivers: Distances.
Fig. 13
Fig. 13
Projections for Swiss. The top row (a) includes histograms of the pairwise dissimilarities The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 14
Fig. 14
Projections for Attitude. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 15
Fig. 15
Projections for Cars. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 16
Fig. 16
Projections for Trees. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 17
Fig. 17
Projections for USJudgeRatings. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).
Fig. 18
Fig. 18
Projections for USArrests. The top row (a) includes histograms of the pairwise dissimilarities. The bottom row (b) includes histograms of the first principal component (PCA).

References

    1. Adolfsson A., Ackerman M., Brownstein N.C. To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recogn. 2018;88:13–26. doi: 10.1016/j.patcog.2018.10.026. - DOI
    1. Azzalini A., Bowman A.W. A look at some data on the old faithful geyser. Appl. Stat. 1990:357–365.
    1. Chatterjee S., Price B. John Wiley & Sons; 1991. Regression Analysis by Example.
    1. Ezekiel M. vol. 427. 1930. (Methods of Correlation Analysis). New York and London.
    1. Fisher R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936;7(2):179–188.

LinkOut - more resources